Open-source AI must reveal its training data, per new OSI definition

Source: The Verge

The Open Source Initiative (OSI) has released its official definition of “open” artificial intelligence, setting the stage for a clash with tech giants like Meta — whose models don’t fit the rules.

OSI has long set the industry standard for what constitutes open-source software, but AI systems include elements that aren’t covered by conventional licenses, like model training data. Now, for an AI system to be considered truly open source, it must provide:

Access to details about the data used to train the AI so others can understand and re-create it
The complete code used to build and run the AI
The settings and weights from the training, which help the AI produce its results

This definition directly challenges Meta’s Llama, widely promoted as the largest open-source AI model. Llama is publicly available for download and use, but it has restrictions on commercial use (for applications with over 700 million users) and does not provide access to training data, causing it to fall short of OSI’s standards for unrestricted freedom to use, modify, and share.

Meta spokesperson Faith Eischen told The Verge that while “we agree with our partner OSI on many things,” the company disagrees with this definition. “There is no single open source AI definition, and defining it is a challenge because previous open source definitions do not encompass the complexities of today’s rapidly advancing AI models.”

“We will continue working with OSI and other industry groups to make AI more accessible and free responsibly, regardless of technical definitions,” Eischen added.

For 25 years, OSI’s definition of open-source software has been widely accepted by developers who want to build on each other’s work without fear of lawsuits or licensing traps. Now, as AI reshapes the landscape, tech giants face a pivotal choice: embrace these established principles or reject them. The Linux Foundation has also made a recent attempt to define “open-source AI,” signaling a growing debate over how traditional open-source values will adapt to the AI era.

“Now that we have a robust definition in place maybe we can push back more aggressively against companies who are ‘open washing’ and declaring their work open source when it actually isn’t,” Simon Willison, an independent researcher and creator of the open-source multi-tool Datasette, told The Verge.

Hugging Face CEO Clément Delangue called OSI’s definition “a huge help in shaping the conversation around openness in AI, especially when it comes to the crucial role of training data.”

OSI’s executive director Stefano Maffulli says it took the initiative two years, consulting experts globally, to refine this definition through a collaborative process. This involved working with experts from academia on machine learning and natural language processing, philosophers, content creators from the Creative Commons world, and more.

While Meta cites safety concerns for restricting access to its training data, critics see a simpler motive: minimizing its legal liability and safeguarding its competitive advantage. Many AI models are almost certainly trained on copyrighted material; in April, The New York Times reported that Meta internally acknowledged there was copyrighted content in its training data “because we have no way of not collecting that.” There’s a litany of lawsuits against Meta, OpenAI, Perplexity, Anthropic, and others for alleged infringement. But with rare exceptions — like Stable Diffusion, which reveals its training data — plaintiffs must currently rely on circumstantial evidence to demonstrate that their work has been scraped.

Meanwhile, Maffulli sees open-source history repeating itself. “Meta is making the same arguments” as Microsoft did in the 1990s when it saw open source as a threat to its business model, Maffulli told The Verge. He recalls Meta telling him about its intensive investment in Llama, asking him “who do you think is going to be able to do the same thing?” Maffulli saw a familiar pattern: a tech giant using cost and complexity to justify keeping its technology locked away. “We come back to the early days,” he said.

“That’s their secret sauce,” Maffulli said of the training data. “It’s the valuable IP.”

Read Full Article