A recent investigation has revealed that some of the world’s largest tech companies trained their AI models on a dataset containing transcripts of over 173,000 YouTube videos without permission. This dataset, created by the nonprofit EleutherAI, includes transcripts from more than 48,000 channels and has been used by companies such as Apple, NVIDIA, and Anthropic. The findings highlight a concerning issue in AI development: the technology often relies on data taken from creators without their consent or compensation.
While the dataset does not include any actual videos or images, it contains transcripts from prominent YouTube creators like Marques Brownlee and MrBeast, as well as major news outlets such as The New York Times, BBC, and ABC News. Transcripts from videos by Engadget are also included.
Marques Brownlee commented on X, “Apple has sourced data for their AI from several companies. One of them scraped tons of data/transcripts from YouTube videos, including mine. This is going to be an evolving problem for a long time.”
A Google spokesperson reiterated YouTube CEO Neal Mohan’s stance that using YouTube data to train AI models violates the platform’s terms of service. Apple, NVIDIA, Anthropic, and EleutherAI did not respond to requests for comment from Engadget.
AI companies have been criticized for their lack of transparency regarding the data used to train their models. Earlier this month, artists and photographers criticized Apple for not disclosing the sources of training data for Apple Intelligence, its new generative AI feature.
YouTube, the world’s largest video repository, is particularly valuable for AI training due to its vast collection of transcripts, audio, video, and images. Earlier this year, OpenAI’s CTO, Mira Murati, avoided specifying whether YouTube videos were used to train Sora, the company’s upcoming AI video generation tool, stating only that the data was publicly available or licensed. Alphabet CEO Sundar Pichai has also affirmed that using YouTube data for AI training violates the platform’s terms of service.