OpenAI vs ANI: Why ‘hallucinations’ are unlikely to go away soon
Source: Live Mint
Microsoft-backed OpenAI is no stranger to lawsuits. The soaring reach of its large language model (LLM)-based chatbot, ChatGPT, has continually rattled media houses, artists, and content creators across the world, prompting them to initiate legal action for copyright infringement. They argue that OpenAI is illegally scraping their high-quality and original data to train its models, and profiting from the exercise without remunerating them.
OpenAI faces multiple lawsuits for copyright infringement—13 in the US, two in Canada and one in Germany. However, in a first, an Indian news agency, ANI, has sued OpenAI—not just for copyright infringement but also for allegedly tarnishing its reputation by attributing fabricated news to it. The Delhi high court, which issued a summons to OpenAI on Tuesday, has appointed an amicus curiae (legal term for ‘friend of the court’) to assist in the case, and has scheduled a January hearing.
OpenAI typically counters copyright violation allegations by arguing that there’s no monopoly on facts, and that copyright laws protect only expressions and not facts. In ANI’s case, it added that the news agency had not provided proof of ChatGPT reproducing copyrighted material. The company cited similar lawsuits abroad, none of which resulted in copyright violations, and claimed jurisdictional irrelevance since its servers operate outside India. Incidentally, copyright violation cases have also been filed by global publishing houses against other tech companies, including Microsoft, Anthropic and Perplexity.
OpenAI, though, is yet to address the issue of fabricating news, and wrongly attributing it to ANI.
LLMs, as we know by now, operate by anticipating probable words following a given text sequence, drawing from a vast pool of examples from their training corpus. Simply put, they are next-word prediction engines, which also explains why they “hallucinate” or throw up false information at times or wrongly attribute data to non-existent sources.
ANI is just a case in point. On 9 June 2023, OpenAI was sued by Georgia radio host Mark Walters, too, when he discovered that ChatGPT was spreading false information about him. A technologist, Jeffery Battle, sued Microsoft after Bing’s integration with ChatGPT falsely linked him to a convicted terrorist, claiming reputational harm due to misattribution.
Newer models have fewer hallucinations, but the problem is unlikely to disappear.
The fact remains that LLMs and LLM-powered chatbots are trained on humongous volumes of data. ChatGPT-3.5, for instance, was trained on 570GB of text data from the internet containing hundreds of billions of words, including text harvested from books, articles, and websites, including social media, following which these chatbots generate new content like text, images, videos, code, etc., with natural language prompts. That said, as explained in New York Times’ 69-page suit, this content is produced by journalists who not only spend considerable time and effort reporting pieces, but also review the articles for accuracy, independence, and fairness.
NYT added in its suit that it depends on its exclusive rights of reproduction, adaptation, publication, performance, and display under copyright law to resist these forces. Simply put, it wants to be paid for the copyrighted content that AI-powered chatbots like OpenAI and Bing Chat have been allegedly reproducing verbatim in many cases. It appears to be a fair demand given the loss of advertising and affiliate referral revenue, other than the “free-riding on The Times’s significant efforts and investment of human capital to gather this information”.
Also Read: Mint Primer: What if ChatGPT’s AI search engine clicks with users?
Avoiding lawsuits
OpenAI, meanwhile, is attempting to avoid potential lawsuits by entering into licensing agreements with many publishers. For instance, it signed a deal this May to access current and archived content from News Corp’s major publications. It has also struck a deal with Conde Nast for content from the publisher’s brands.
Big tech companies like OpenAI and Meta have also got some reprieve in the courts in specific cases of copyright violations but the courts are yet to address whether or not the unauthorised use of material scraped from the internet to train AI models infringes copyrights on a massive scale.
And while there are no injunctions on ChatGPT as yet, the outcome of its pending suits will define how foundational models and LLMs are shaped going forward, failing which they will continue to be hauled over the coals and be viewed with suspicion and mistrust.
Also Read | Mint Explainer: What OpenAI o1 ‘reasoning’ model means for the future of generative AI