Where does AI training data come from?
A report from New York Times revealed on Friday that OpenAI may have trained AI models on YouTube video transcripts, and Google may have done the same.
The report revealed that in search of fresh digital data to train its newest and smartest AI system, OpenAI researchers created a solution called Whisper that could take YouTube videos and transcribe them those in text that could then be fed as new AI training data—for a more conversational, next-generation AI.
The development process of GPT-4, powerful AI model after OpenAI's latest chatbot ChatGPT received over a million hours of YouTube videos transcribed by Whisper, according to NYTimes' the resources.
Connected: OpenAI is holding back on the release of its new AI voice generator
recently reports that OpenAI employees had conversations about how YouTube's transcription training data might violate YouTube's rules, but OpenAI decided to move forward anyway in the belief that training the AI with the videos was fair use.
Knowledge of where the training data came from was extended to senior leadership, according to recentlywith OpenAI president Greg Brockman allegedly helping to collect the videos.
The Wall Street Journal's Joanna Stern interviewed OpenAI CTO Mira Murati last month and asked him what data was used to train one of OpenAI's latest products: a tool called Sora that generates video based on text requests.
Connected: Authors are suing OpenAI because ChatGPT is too 'accurate'
“We used public and licensed data”, said Murati. When Stern asked “So YouTube videos?” Murat replied: “Actually, I'm not sure about that.”
When Stern asked further “Video from Facebook, Instagram?” Murati said, “You know, if they were publicly available, publicly available to use, there might be data, but I'm not sure. I'm not sure about that.”
YouTube CEO Neal Mohan said Last week that if OpenAI were to use YouTube videos to train Sora, it would be a “clear violation” of YouTube's terms of use.
The terms of service “do not allow things like transcripts or video clips to be downloaded,” Mohan said Emily Chang, host of Bloomberg Originals.
However, five sources said recently that Google did the same thing as OpenAI, transcribing YouTube videos to generate new training text for its AI models in a possible violation of copyright law.
Google owns YouTube and said recently that its AI is “trained on certain YouTube content” that its agreements with creators allow.
Lawsuits for training AI with copyrighted material have become widespread in recent years, with author like Paul Tremblay and Sarah Silverman claiming their books were part of the datasets used to train AI – without their consent.
The attorneys for these lawsuits, Joseph Saveri and Matthew Butterick, COUNTRY on their website that generative AI is simply “human intelligence, repackaged and shared by its creators.”
More than 15,000 authors signed a letter last year asking big tech CEOs, including those at OpenAI, Google, Microsoft, Meta, and IBM, to get writers' consent before training AIs with their work and crediting and compensating them .
It's not just authors: musicians are feeling the impact of AI, too. Artists such as Billie Eilish and Jon Bon Jovi signed an open letter last week accusing major tech companies of using their work to train models without permission or compensation.
“These efforts aim to replace the work of human artists with massive amounts of AI-generated 'sounds' and 'images' that significantly dilute the royalties paid to artists,” the letter said. DECLARING.
Tennessee was done the first state to pass legislation protecting artists from “deepfakes,” or cloned and manipulated versions of their voices, last month.
Connected: Tennessee just passed a new law to protect musicians from a growing AI threat