Apple, NVIDIA and Anthropic reportedly used YouTube transcripts without permission to train AI models

A recent investigation has revealed that some of the world’s largest tech companies trained their AI models on a dataset containing transcripts of over 173,000 YouTube videos without permission. This dataset, created by the nonprofit EleutherAI, includes transcripts from more than 48,000 channels and has been used by companies such as Apple, NVIDIA, and Anthropic. The findings highlight a concerning issue in AI development: the technology often relies on data taken from creators without their consent or compensation.

While the dataset does not include any actual videos or images, it contains transcripts from prominent YouTube creators like Marques Brownlee and MrBeast, as well as major news outlets such as The New York Times, BBC, and ABC News. Transcripts from videos by Engadget are also included.

Marques Brownlee commented on X, “Apple has sourced data for their AI from several companies. One of them scraped tons of data/transcripts from YouTube videos, including mine. This is going to be an evolving problem for a long time.”

A Google spokesperson reiterated YouTube CEO Neal Mohan’s stance that using YouTube data to train AI models violates the platform’s terms of service. Apple, NVIDIA, Anthropic, and EleutherAI did not respond to requests for comment from Engadget.

AI companies have been criticized for their lack of transparency regarding the data used to train their models. Earlier this month, artists and photographers criticized Apple for not disclosing the sources of training data for Apple Intelligence, its new generative AI feature.

YouTube, the world’s largest video repository, is particularly valuable for AI training due to its vast collection of transcripts, audio, video, and images. Earlier this year, OpenAI’s CTO, Mira Murati, avoided specifying whether YouTube videos were used to train Sora, the company’s upcoming AI video generation tool, stating only that the data was publicly available or licensed. Alphabet CEO Sundar Pichai has also affirmed that using YouTube data for AI training violates the platform’s terms of service.

- Advertisement -

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles

error: Content is protected !!

Sign Up for CXO Digital Pulse Newsletters

Sign Up for CXO Digital Pulse Newsletters to Download the Research Report

Sign Up for CXO Digital Pulse Newsletters to Download the Coffee Table Book

Sign Up for CXO Digital Pulse Newsletters to Download the Vision 2023 Research Report

Download 8 Key Insights for Manufacturing for 2023 Report

Sign Up for CISO Handbook 2023

Download India’s Cybersecurity Outlook 2023 Report

Unlock Exclusive Insights: Access the article

Download CIO VISION 2024 Report

Share your details to download the report

Share your details to download the CISO Handbook 2024

Fill your details to Watch