The open-source AI debate: Why selective transparency poses a serious risk

The open-source AI debate: Why selective transparency poses a serious risk
Source: Venture Beat

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More


As tech giants declare their AI releases open — and even put the word in their names — the once insider term “open source” has burst into the modern zeitgeist. During this precarious time in which one company’s misstep could set back the public’s comfort with AI by a decade or more, the concepts of openness and transparency are being wielded haphazardly, and sometimes dishonestly, to breed trust. 

At the same time, with the new White House administration taking a more hands-off approach to tech regulation, the battle lines have been drawn — pitting innovation against regulation and predicting dire consequences if the “wrong” side prevails. 

There is, however, a third way that has been tested and proven through other waves of technological change. Grounded in the principles of openness and transparency, true open source collaboration unlocks faster rates of innovation even as it empowers the industry to develop technology that is unbiased, ethical and beneficial to society. 

Understanding the power of true open source collaboration

Put simply, open-source software features freely available source code that can be viewed, modified, dissected, adopted and shared for commercial and noncommercial purposes — and historically, it has been monumental in breeding innovation. Open-source offerings Linux, Apache, MySQL and PHP, for example, unleashed the internet as we know it. 

Now, by democratizing access to AI models, data, parameters and open-source AI tools, the community can once again unleash faster innovation instead of continually recreating the wheel — which is why a recent IBM study of 2,400 IT decision-makers revealed a growing interest in using open-source AI tools to drive ROI. While faster development and innovation were at the top of the list when it came to determining ROI in AI, the research also confirmed that embracing open solutions may correlate to greater financial viability.

Instead of short-term gains that favor fewer companies, open-source AI invites the creation of more diverse and tailored applications across industries and domains that might not otherwise have the resources for proprietary models. 

Perhaps as importantly, the transparency of open source allows for independent scrutiny and auditing of AI systems’ behaviors and ethics — and when we leverage the existing interest and drive of the masses, they will find the problems and mistakes as they did with the LAION 5B dataset fiasco. 

In that case, the crowd rooted out more than 1,000 URLs containing verified child sexual abuse material hidden in the data that fuels generative AI models like Stable Diffusion and Midjourney — which produce images from text and image prompts and are foundational in many online video-generating tools and apps. 

While this finding caused an uproar, if that dataset had been closed, as with OpenAI’s Sora or Google’s Gemini, the consequences could have been far worse. It’s hard to imagine the backlash that would ensue if AI’s most exciting video creation tools started churning out disturbing content.

Thankfully, the open nature of the LAION 5B dataset empowered the community to motivate its creators to partner with industry watchdogs to find a fix and release ​​RE-LAION 5B — which exemplifies why the transparency of true open-source AI not only benefits users, but the industry and creators who are working to build trust with consumers and the general public. 

The danger of open sourcery in AI

While source code alone is relatively easy to share, AI systems are far more complicated than software. They rely on system source code, as well as the model parameters, dataset, hyperparameters, training source code, random number generation and software frameworks — and each of these components must work in concert for an AI system to work properly.

Amid concerns around safety in AI, it has become commonplace to state that a release is open or open source. For this to be accurate, however, innovators must share all the pieces of the puzzle so that other players can fully understand, analyze and assess the AI system’s properties to ultimately reproduce, modify and extend its capabilities. 

Meta, for example, touted Llama 3.1 405B as “the first frontier-level open-source AI model,” but only publicly shared the system’s pre-trained parameters, or weights, and a bit of software. While this allows users to download and use the model at will, key components like the source code and dataset remain closed — which becomes more troubling in the wake of the announcement that Meta will inject AI bot profiles into the ether even as it stops vetting content for accuracy. 

To be fair, what is being shared certainly contributes to the community. Open weight models offer flexibility, accessibility, innovation and a level of transparency. DeepSeek’s decision to open source its weights, release its technical reports for R1 and make it free to use, for example, has enabled the AI community to study and verify its methodology and weave it into their work. 

It is misleading, however, to call an AI system open source when no one can actually look at, experiment with and understand each piece of the puzzle that went into creating it.

This misdirection does more than threaten public trust. Instead of empowering everyone in the community to collaborate, build and advance upon models like Llama X, it forces innovators using such AI systems to blindly trust the components that are not shared.

Embracing the challenge before us

As self-driving cars take to the streets in major cities and AI systems assist surgeons in the operating room, we are only at the beginning of letting this technology take the proverbial wheel. The promise is immense, as is the potential for error — which is why we need new measures of what it means to be trustworthy in the world of AI.

Even as Anka Reuel and colleagues at Stanford University recently attempted to set up a new framework for the AI benchmarks used to assess how well models perform, for example, the review practice the industry and the public rely on is not yet sufficient. Benchmarking fails to account for the fact that datasets at the core of learning systems are constantly changing and that appropriate metrics vary from use case to use case. The field also still lacks a rich mathematical language to describe the capabilities and limitations in contemporary AI. 

By sharing entire AI systems to enable openness and transparency instead of relying on insufficient reviews and paying lip service to buzzwords, we can foster greater collaboration and cultivate innovation with safe and ethically developed AI. 

While true open-source AI offers a proven framework for achieving these goals, there’s a concerning lack of transparency in the industry. Without bold leadership and cooperation from tech companies to self-govern, this information gap could hurt public trust and acceptance. Embracing openness, transparency and open source is not just a strong business model — it’s also about choosing between an AI future that benefits everyone instead of just the few. 

Jason Corso is a professor at the University of Michigan and co-founder of Voxel51.



Read Full Article