Apple, Nvidia, Anthropic Used Thousands of Swiped YouTube Videos To Train AI (wired.com) 52

Posted by msmash on Tuesday July 16, 2024 @10:43AM from the closer-look dept.

AI companies are generally secretive about their sources of training data, but an investigation by Proof News found some of the wealthiest AI companies in the world have used material from thousands of YouTube videos to train AI. Companies did so despite YouTube's rules against harvesting materials from the platform without permission. From a report: Our investigation found that subtitles from 173,536 YouTube videos, siphoned from more than 48,000 channels, were used by Silicon Valley heavyweights, including Anthropic, Nvidia, Apple, and Salesforce. The dataset, called YouTube Subtitles, contains video transcripts from educational and online learning channels like Khan Academy, MIT, and Harvard. The Wall Street Journal, NPR, and the BBC also had their videos used to train AI, as did The Late Show With Stephen Colbert, Last Week Tonight With John Oliver, and Jimmy Kimmel Live.

Proof News also found material from YouTube megastars, including MrBeast (289 million subscribers, two videos taken for training), Marques Brownlee (19 million subscribers, seven videos taken), Jacksepticeye (nearly 31 million subscribers, 377 videos taken), and PewDiePie (111 million subscribers, 337 videos taken). Some of the material used to train AI also promoted conspiracies such as the "flat-earth theory." Further reading: YouTube Says OpenAI Training Sora With Its Videos Would Break Rules.

Apple, Nvidia, Anthropic Used Thousands of Swiped YouTube Videos To Train AI

This discussion has been archived. No new comments can be posted.

Load All Comments

Search 52 Comments Log In/Create an Account

Comments Filter:

Copyright (Score:4, Interesting)

by Freischutz ( 4776131 ) writes: on Tuesday July 16, 2024 @10:54AM (#64629663)

The hypocrisy of how all concerns corporations have about copyright violation and TOS violations evaporate when when they need training data is downright amusing. Jane and Joe six-pack would get the entire law library thrown at them for doing this.

- Re: (Score:2)
  
  by penguinoid ( 724646 ) writes:
  
  Eh I think Joe Sixpack could pirate himself a few $60 games and not even get a slap on the wrist.
  - - Re: (Score:2)
      
      by Rei ( 128717 ) writes:
      
      You're logically inconsistent.
      What is "sucking blood"? They're not depriving anyone of their assets. It's just normal internet traffic. Are you talking about "sucking potential assets, via creating competition"? Then how do you reconcile that with claiming that the product is just a "scam"? Why would there be a mass turn of users to AI tools if they didn't actually work? Your logic is inconsistent here.
      Also, the overwhelming majority of AI is entirely free. The proprietary models all have free option
      - Re: (Score:2)
        
        by Rei ( 128717 ) writes:
        
        They're "taking" it? You mean the creators don't have it anymore? OMG, that's awful!
  - Re: (Score:2)
    
    by fleeped ( 1945926 ) writes:
    
    Joe Sixpack would maybe play these games, for a while. Aforementioned corporations violate copyrights to SELL you their services. Huge difference.
    - Re: (Score:2)
      
      by Hadlock ( 143607 ) writes:
      
      If you watch a video on how to do an oil change for your car, then charge your neighbor $20 to change their oil, do you owe the video creator royalties? I don't see why showing the video to a computer is any different.
      - Re: (Score:2)
        
        by penguinoid ( 724646 ) writes:
        
        Data centers and computer cycles aren't any freer than physical labor.
- Humans can watch an learn, are AI's prohibited? (Score:2)
  
  by drnb ( 2434720 ) writes:
  
  The hypocrisy of how all concerns corporations have about copyright violation and TOS violations evaporate when when they need training data is downright amusing. Jane and Joe six-pack would get the entire law library thrown at them for doing this.
  Did they download the video? Did they comply with the YouTube download policy? Maybe copyright could be an issue here.
  
  Did the AI "watch" a normally streamed video on YouTube? That would not seem to be a copyright issue. Isn't learning something from a YouTube video something that is allowed? I learned how to replace a broken plastic lens on my car. It seemed quite apparent that was the intent of the video. Is there something in the policies that only humans may learn something, that AI must not?
  
  In ant
Of course they're secretive (Score:2)

by Rosco P. Coltrane ( 209368 ) writes:

AI companies are generally secretive about their sources of training data
That's because their training data is mostly stolen property. Like any half-clever thief, they keep quiet about their loot.
- Re:Of course they're secretive (Score:4, Funny)
  
  by Rei ( 128717 ) writes: on Tuesday July 16, 2024 @11:12AM (#64629717) Homepage
  
  Huh, they STOLE the subtitles? That's terrible! Now all of those channel owners have to recreate their subtitles from scratch because they no longer have them, right?
  
  - Re: (Score:2)
    
    by postbigbang ( 761081 ) writes:
    
    Great strawman.
    These creators and their distributor were flat robbed of their content. Don't use weasel words. Call it what it is: Theft on a grand scale, a mockery of copyright and intellectual property laws.
    The models trained on this are now permanently and inherently spoiled stolen goods. "Move fast and break things" creates vandals and thieves.
    - Re: (Score:2)
      
      by Rei ( 128717 ) writes:
      
      These creators and their distributor were flat robbed of their content.
      So you're confirming that they no longer have it! OMG! Those poor people, having to recreate all their subtitles from scratch :(
      - Re: (Score:1)
        
        by postbigbang ( 761081 ) writes:
        
        Once again, an unauthorized use of their content constitutes IP theft. There is no inherent corporate right to lift materials at random, then put them into the garbage disposal of AI training.
        Youtube et al must defend against this theft, because it's an alteration of the protections of their ToS.
        It's theft, purely, simply. It's grifting the IP of other without compensation or permission.
        
        Re:Of course they're secretive (Score:4, Insightful)
        
        by Rei ( 128717 ) writes: on Tuesday July 16, 2024 @12:37PM (#64629969) Homepage
        
        The term you're looking for is "piracy", not "theft". Stop misusing words. And even piracy only applies in cases where it's not fair use. E.g. you can't sue Google for spidering your website to build a search engine they're going to profit off of. They have every right to download copyrighted information off the internet, store it on their servers, and use it to make new products.
        For a more extreme case, see the Authors Guild Inc v. Google Inc. The Authors Guild explicitly told Google not to scan their books. Google ignored them and scanned the books en masse, and posted excerpts of paragraphs or even whole pages verbatim online, without permission. The Authors Guild sued. They lost.
        Copyright isn't a dictatorship. Fair use exists.
        
        
        Re: (Score:2)
        
        by postbigbang ( 761081 ) writes:
        
        Not comparable. From YouTube's ToS ...
        Permissions and Restrictions
        You may access and use the Service as made available to you, as long as you comply with this Agreement and applicable law. You may view or listen to Content for your personal, non-commercial use. You may also show YouTube videos through the embeddable YouTube player.
        The following restrictions apply to your use of the Service. You are not allowed to:
        access, reproduce, download, distribute, transmit, broadcast, display, sell, license, alter, mo
        
        Re: (Score:3)
        
        by Rei ( 128717 ) writes:
        
        And violating a TOS is neither theft nor piracy.
        You seem to hate language to the point that I have to wonder if language killed your parents or something.
        
        Re: (Score:2)
        
        by Rei ( 128717 ) writes:
        
        You're now confusing etymology with meaning.
        There isn't a facepalm big enough for you.
        And no, the word piracy wasn't "made up by the content industry"; it's been used in the context of "unauthorized copying of materials" since at least 1668. It first showed up in a dictionary in this context in 1736 (this is before even Samuel Johnson's dictionary came out - we're talking early in the history of English dictionaries)..
        Just to be clear here:
        Copyright holders frequently refer to copyright infringement as the
        
        Re: (Score:2)
        
        by drnb ( 2434720 ) writes:
        
        There is no inherent corporate right to lift materials at random, then put them into the garbage disposal of AI training.
        There is a right for a person to randomly view videos and perhaps learn something from them. Are AI's prohibited from streaming and learning something? That's in the YouTube terms of use?
        Youtube et al must defend against this theft, ...
        Are the AI's rebroadcasting the video, including modified derived versions?
        
        If a human learns something from a video, are they prohibited from sharing that knowledge in their own words?
        
        Are they prohibited from putting that knowledge to commercial use?
        
        Re: (Score:2)
        
        by postbigbang ( 761081 ) writes:
        
        You speak of humans; they're not involved here except as the hand on the vacuum cleaner, who then manifest what they've vacuumed into another form, retaining the original
        Mincing words to apply humanity to crawling, spidering, and other forms of wholesale, ToS and creator rights violations doesn't change the nature of the word "theft".
        
        "Theft" of streaming service, not any information (Score:2)
        
        by drnb ( 2434720 ) writes:
        
        You are missing the point. If a human can learn something from a YouTube video and it is not infringement or some other sort of illegal act, then the same would be true for an AI.
        
        Where human vs AI is relevant is the YouTube terms of service. It may prohibit bots, desiring only to stream to a live human.
        
        The only potential "theft" here is theft of a service, the streaming video, not the theft of any information.
        
        Re: (Score:2)
        
        by postbigbang ( 761081 ) writes:
        
        Analogously, if a human reads a book, that's what it's there for, reading. If that human makes and distributes many copies of that book, one protected by copyright, then it's piracy of that book and infringement of the intellectual property rights endowed the rights-holder of that book, and the rights-holder has had their work stolen.
        If AI training produces an LLM that can provably regurgitate the content they don't own the rights to, it's the same theft of IP.
        
        Re: (Score:2)
        
        by drnb ( 2434720 ) writes:
        
        Analogously, if a human reads a book, that's what it's there for, reading. If that human makes and distributes many copies of that book, one protected by copyright, then it's piracy of that book and infringement of the intellectual property rights endowed the rights-holder of that book, and the rights-holder has had their work stolen.
        If AI training produces an LLM that can provably regurgitate the content they don't own the rights to, it's the same theft of IP.
        Agreed wrt regurgitation. However if the AI is learning as the human reader had done, where is the problem? Like the book, the video is presented to be understood and learned from.
        
        Re: (Score:2)
        
        by postbigbang ( 761081 ) writes:
        
        Book reader, as many views as you want to reread-- it's your book.
        Content for one or two people from say, YouTube. YouTube prefers 1:1 but doesn't care about say, a sports bar.
        Single crawler digests a site. The content is repurposed and re-manifest on demand of queries, one to many, indiscriminately.
        Content creator's goals: Paid by views/likes/channel members/etc, whatever. More is better, but more revenue for miscellaneous thresholds by volume.
        LLM reuse: no control, no revenue, no reimbursement for the con
        
        Re: (Score:2)
        
        by drnb ( 2434720 ) writes:
        
        I'm not sure where most of that is in conflict with what I wrote, whether complete or digest. An AI regurgitating content would be a copyright infringement. YouTube's terms of use may only allow human users, for various reasons which may include costs and revenue.
        
        Where I see the fair use is in learning, acquiring knowledge for the sake of application. YouTube nor video authors have any claim on my commercial application of general knowledge I gained from them. Why would it be different for an AI? In this
        
        Re: (Score:2)
        
        by Travelsonic ( 870859 ) writes:
        
        Once again, an unauthorized use of their content constitutes IP theft.
        Citation? There literally is no crime called that - even places that have the term describe the existing definition of "copyright infringement," which is not theft - it's copyright infringement. MGM's lawyer got chewed out during MGM v. Grokster for instance BECAUSE they kept asserting that using a term a certain way automatically made it relevant / applicable (when it doesn't, as the judge who chewed him out would elaborate).
        
        Re: (Score:2)
        
        by postbigbang ( 761081 ) writes:
        
        Copyright infringement is a form of theft. Call it by its name. Theft.
    - Re: (Score:2)
      
      by smooth wombat ( 796938 ) writes:
      
      Call it what it is: Theft on a grand scale, a mockery of copyright and intellectual property laws.
      
      Sooo, The PIrate Bay?
      - Re: (Score:2)
        
        by postbigbang ( 761081 ) writes:
        
        Different theory of ripoff; both are unauthorized uses.
        The Pirate Bay makes no claim to being a Fair Use source.
        You might think that Apple and others would respect IP, but that thought is in the ditch. The big straw of IP theft to improve LLMs and AI training has become beyond abusive, perhaps criminal.
        But the courts are in disarray, a handy fact for tech titans.
    - Re: Of course they're secretive (Score:2)
      
      by drinkypoo ( 153816 ) writes:
      
      First you said it was stolen, which it wasn't.
      Now you're saying they were robbed, which implies violence or threat, which didn't happen even more.
      You actually went from incorrect to even less correct there.
      Stop glorifying modern copyright. It is a cash grab that interferes with our rights, nothing more.
    - Re: (Score:2)
      
      by Travelsonic ( 870859 ) writes:
      
      Don't use weasel words
      You're literally misusing terms with legal meaning ... in a post discussing an article regarding legal claims/allegations, fuck off with that.
  - Re: (Score:2)
    
    by necro81 ( 917438 ) writes:
    
    Huh, they STOLE the subtitles? That's terrible! Now all of those channel owners have to recreate their subtitles from scratch because they no longer have them, right?
    Most subtitles are auto-generated by YouTube anyway, and are of pretty mediocre quality. If anything, it's likely to cause problems for LLM training: you're feeding the output of YouTube's speech-to-text engine back into a model. Models trained on models tend to not do so well.
    - Re: (Score:2)
      
      by Rei ( 128717 ) writes:
      
      These were non-autogenerated subs.
Big assumptions about permission (Score:5, Insightful)

by Roogna ( 9643 ) writes: on Tuesday July 16, 2024 @11:18AM (#64629731)

I imagine Google simply made a fair chunk of money off of all this (or more likely in Apple's case paid Apple slightly "less" on paper for Safari's default search engine spot). After all, why "steal" when one can simply avoid TOS issues by exchanging a few words in a contract someplace.
Now the fact that user's still upload stuff to Google giving them the rights to mostly do whatever they want with the upload is an entirely different issue.

Flat Earth is a flat problem (Score:1)

by Tablizer ( 95088 ) writes:

to train AI also promoted conspiracies such as the "flat-earth theory.
That's the least of our fake-news problems. Flatties are mostly harmless morons, contrasted with say people who punch trans over "grooming" rumors, or overthrow democracy because "ballots were rigged".
The language is misleading... (Score:3)

by laird ( 2705 ) writes: <[moc.liamg] [ta] [pdrial]> on Tuesday July 16, 2024 @11:44AM (#64629805) Journal

The language is misleading. Nothing was taken or stolen, they read the transcripts of the videos and learned from them, they didn't take the videos away from YouTube, and the videos were all public.

- Re: (Score:2)
  
  by necro81 ( 917438 ) writes:
  
  ...and the videos were all public.
  I'm not sure that "we'll post videos for billions of individuals to watch one at a time in a monetized fashion" is quite the same as "I'm going to have my supercomputer suck up all the content in the world, then create an infinity of derivative works and charge everyone for that."
- Re: (Score:2)
  
  by Pinky's Brain ( 1158667 ) writes:
  
  Read? Colloquially computers could be said to read, but in this context it is clearly misleading.
  Computers do not process digital data without making an identical copy. Even in some theoretical extreme where it's transformed and discarded after every bit, a literal copy is inherent. Practically the entire training set is literally copied to long term storage.
  - Re: (Score:2)
    
    by laird ( 2705 ) writes:
    
    Some copyright holders tried to make that argument, suing people who read content online because there were copies made in routers, modems, in computer memory, on screen, etc., and they lost those lawsuits, because copyright prohibits distributing copies.
- Weren't the transcripts AI generated? (Score:2)
  
  by drnb ( 2434720 ) writes:
  
  The language is misleading. Nothing was taken or stolen, they read the transcripts of the videos and learned from them, they didn't take the videos away from YouTube, and the videos were all public.
  Weren't the transcripts AI generated?
Were they really just scraping up the subtitles? (Score:2)

by qwerty shrdlu ( 799408 ) writes:

God help us all if they're training AI with the comments.
This won't matter until (Score:2)

by hdyoung ( 5182939 ) writes:

someone actually makes a profit from their internet-trained LLM . The specialty AI models trained on specific data sets are probably fine. But all of the LLM models trained on "teh interweb" seem to be flagrantly built on data theft. As soon as any of those starts making real money, I would imagine that it'll be lawsuit armageddon.

But, currently those AI models are massive loss leaders. They're basically being fueled by shovelling suitcases of hundred dollar bills into a boiler. No lawyer is going to gi
No shit Sherlock (Score:2)

by felixrising ( 1135205 ) writes:

If you've used gpt4o am to act as a translator, you'll no doubt have heard it default to repeating "thanks for watching this video, don't forget to like and subscribe" over and over instead of translating. It's pretty well featured on Reddit and elsewhere. Very obvious they trained it on YouTube content for it to have learnt that is how it should respond.
they are good at pirating ? (Score:2)

by gitano_dbs ( 1490853 ) writes:

I mean, they pirating.. err training on the good/usefull ones ?

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Copyright (Score:4, Interesting)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Humans can watch an learn, are AI's prohibited? (Score:2)

Of course they're secretive (Score:2)

Re:Of course they're secretive (Score:4, Funny)

Re: (Score:2)

Re: (Score:2)

Re: (Score:1)

Re:Of course they're secretive (Score:4, Insightful)

Re: (Score:2)

Re: (Score:3)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

"Theft" of streaming service, not any information (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: Of course they're secretive (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Big assumptions about permission (Score:5, Insightful)

Flat Earth is a flat problem (Score:1)

The language is misleading... (Score:3)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Weren't the transcripts AI generated? (Score:2)

Were they really just scraping up the subtitles? (Score:2)

This won't matter until (Score:2)

No shit Sherlock (Score:2)

they are good at pirating ? (Score:2)

Related Links Top of the: day, week, month.

Slashdot Top Deals