Skip to main content

Apple trained AI models on YouTube content without consent [U: MKBHD responds]

A number of tech giants, including Apple, trained AI models on YouTube videos without the consent of the creators, according to a new report today.

They did this by using subtitle files downloaded by a third party from more than 170,000 videos. Creators affected include tech reviewer Marquees Brownlee (MKBHD), MrBeast, PewDiePie, Stephen Colbert, John Oliver, and Jimmy Kimmel …

The subtitle files are effectively transcripts of the video content.

Wired reports.

An investigation by Proof News found some of the wealthiest AI companies in the world have used material from thousands of YouTube videos to train AI. Companies did so despite YouTube’s rules against harvesting materials from the platform without permission.

Our investigation found that subtitles from 173,536 YouTube videos, siphoned from more than 48,000 channels, were used by Silicon Valley heavyweights, including Anthropic, Nvidia, Apple, and Salesforce.

The downloads were reportedly performed by a non-profit called EleutherAI, which says it helps developers train AI models. While the aim appears to have been to provide training materials to small developers and academics, the dataset has also been used by several tech giants, including Apple.

According to a research paper published by EleutherAI, the dataset is part of a compilation the nonprofit released called the Pile […]

Most of the Pile’s datasets are accessible and open for anyone on the internet with enough space and computing power to access them. Academics and other developers outside of Big Tech made use of the dataset, but they weren’t the only ones.

Apple, Nvidia, and Salesforce—companies valued in the hundreds of billions and trillions of dollars—describe in their research papers and posts how they used the Pile to train AI. Documents also show Apple used the Pile to train OpenELM, a high-profile model released in April, weeks before the company revealed it will add new AI capabilities to iPhones and MacBooks.

Wired says Apple hadn’t responded to a request for comment at the time of writing.

MKBHD subsequently put out a brief response video:

9to5Mac’s Take

Top comment by Inkling

Liked by 4 people

Some think this was a privacy violation. It isn't. What was violated was the use of copyrighted content without authors's permission.

Take YouTube host Marquees Brownlee. He earns income from the ads to his tech reviews. Suppose a major TV network were to copy his videos without his permission, broadcast them with ads, and not pay him income from those ads. That would be a flat-out copyright violation. That is covered by copyright law.

The difficulty is that the existing copyright laws, as fixed in the Berne Convention, are based on broadcasting and publishing as they existed in 1971. Computers were mainframes owned by few. There was no Internet, no PCs, no smartphones, no YouTube, and no digital books. AI training has now joined that list. Judges have done their best to extend those early 1970s rules to modern circumstances, but the result is a mess.

There is a principle at play. The copyright for content can be extended to other or "derivative" uses, say a movie or a computer game based on a novel. There the connection is obvious. Anyone who watches the movie The Hunt for Red October, knows it is based on a Tom Clancy novel.

The problem comes when the links between the original and the derivative grow distant. AI, training on millions of words, is very distant indeed. Some say that extending copyright to that use goes too far. Some say it doesn't.

Making matters worse, there is no black-and-white law defining what is covered by these over forty years of advances in technology.

View all comments

It’s important to emphasize here that Apple didn’t download the data itself, but this was instead performed by EleutherAI. It is this organization which appears to have broken YouTube’s terms and conditions.

All the same, while Apple and the other companies named likely used a publicly-available dataset in good faith, it’s a good illustration of the legal minefield created by scraping the web to train AI systems. There have been multiple examples of AI systems plagiarizing entire paragraphs of text when asked about niche topics, and the dangers of using material without permission are only increased when companies use datasets compiled by third parties.

We’ve reached out to Apple for comment, and will update with any response.

Screengrab: MKBHD

FTC: We use income earning auto affiliate links. More.

You’re reading 9to5Mac — experts who break news about Apple and its surrounding ecosystem, day after day. Be sure to check out our homepage for all the latest news, and follow 9to5Mac on Twitter, Facebook, and LinkedIn to stay in the loop. Don’t know where to start? Check out our exclusive stories, reviews, how-tos, and subscribe to our YouTube channel

Comments

Author

Avatar for Ben Lovejoy Ben Lovejoy

Ben Lovejoy is a British technology writer and EU Editor for 9to5Mac. He’s known for his op-eds and diary pieces, exploring his experience of Apple products over time, for a more rounded review. He also writes fiction, with two technothriller novels, a couple of SF shorts and a rom-com!


Ben Lovejoy's favorite gear

Manage push notifications

notification icon
We would like to show you notifications for the latest news and updates.
notification icon
You are subscribed to notifications
notification icon
We would like to show you notifications for the latest news and updates.
notification icon
You are subscribed to notifications