Major technology companies, including OpenAI, Google, Meta, and Anthropic, rely on high-quality, copyrighted material from prominent publishers to train their large language models (LLMs).
This is according to a study conducted by Ziff Davis, the parent company of CNET, IGN, and Mashable, which shows the essential role that high-quality content plays in training these AI models. The study shows that authoritative sources are preferred for training datasets in AI companies to enhance the model’s performance, but the contribution of these sources is often neglected.
In the research, Ziff Davis’ AI attorney, George Wukoson and Chief Technology Officer Joey Fortuna claimed that AI companies choose training data based on the ranking of authoritative websites with high search engine rankings. High quality and popular websites were chosen to improve the models since they have a good reputation. A strategy that, according to the study enables the AI developers to fine-tune the language model.
Ziff Davis has pointed out that top-tier content providers like Axel Springer, Future PLC, Hearst, News Corp, and The New York Times, among others, have contributed to the development of training datasets. In particular, it has been identified that 12.04% of OpenWebText2, which was used for the creation of OpenAI’s GPT-3, came from these trusted publishers.
Mark Zuckerberg also weighed in on the ongoing debate surrounding content use in AI training. In a recent interview with The Verge, Zuckerberg acknowledged that data scraping for AI is challenging but also pointed out that individual creators’ or publisher’s content might not be that impactful. He stated, “I think individual creators or publishers tend to overestimate the value of their specific content in the grand scheme of this.”
Publishers file lawsuits against AI companies
The secrecy around training data sources has raised concerns among publishers and consumers alike. The New York Times and The Wall Street Journal recently filed lawsuits against AI companies, saying that they have violated copyright laws by using their content.
While OpenAI has advanced efforts to obtain content licensing from media organizations such as the Financial Times and DotDash Meredith, several AI firms still work without proper licensing. The report further states that “major LLM developers no longer disclose their training data as they once did.”
While the values of AI companies rise, the gap between technology titans and conventional media companies remains vast. Tech giants such as Google and Meta, which have an estimated value of $2.2 tn and $1.5 tn, respectively, remain at the forefront of generative AI, while startups such as OpenAI and Anthropic are valued at $157 billion and $40 billion respectively.
On the other hand, publishers are still dealing with layoffs and restructuring, which is evidence of the financial pressure of adjusting to an environment more and more defined by AI. As a result of the competition with user-generated and AI-based content, numerous publishers face challenges in terms of reducing costs and staff.
Earn more CFN tokens by sharing this post. Copy and paste the URL below and share to friends, when they click and visit Parrot Coin website you earn: https://cryptoforum.news0
CFN Comment Policy
Your comments MUST BE constructive with vivid and clear suggestion relating to the post.
Your comments MUST NOT be less than 5 words.
Do NOT in any way copy/duplicate or transmit another members comment and paste to earn. Members who indulge themselves copying and duplicating comments, their earnings would be wiped out totally as a warning and Account deactivated if the user continue the act.
Parrot Coin does not pay for exclamatory comments Such as hahaha, nice one, wow, congrats, lmao, lol, etc are strictly forbidden and disallowed. Kindly adhere to this rule.
Constructive REPLY to comments is allowed