Can AI even be open source? It’s complicated
Can AI even be open source? It’s complicated zf L/Getty Images Without open source, there is no artificial intelligence (AI). Period. End of statement. It’s not just that AI’s early roots spring from the 1960s’ open language Lisp; the headline AI generative models, such as ChatGPT, Llama 2, and DALL-E, are built on solid, open-source foundations. However, those models and programs themselves are not open source. Also: AI scientist: ‘We need to think outside the large language model box’ Best Free and Open Business Intelligence Tools CIO Women Magazine Oh, I know that when Meta CEO Mark Zuckerberg unveiled Llama 3.1 in a Threads post, he said, “Open-source AI is the path forward,” and that Meta is “taking the next steps towards open-source AI becoming the industry standard.” At a SIGGRAPH keynote discussion with Nvidea CEO Jensen Huang, Zuckerberg admitted: We’re not pursuing [open source] out of altruism, though I believe it will benefit the ecosystem. We’re doing it because we think it will enhance our offerings by creating a strong ecosystem. … this might sound selfish, but after building this company for a while, one of my goals for the next 10 or 15 years is to ensure we can build the fundamental technology for our social experiences. Zuckerberg is sincere about open source. As we’ve seen repeatedly, open source is the way to unite technologies. For example, we use a unified Linux now instead of multiple, incompatible versions of Unix because Linus Torvalds open-sourced Linux under GPLv2. Best Open Source Tools Data Teams Love (Updated) Also: A new White House report embraces open-source AI But I’ve also read Meta’s Llama 2 license and the Llama Acceptable Use Policy. It’s not open source. It’s not even close. Zuck’s not alone, though, in playing fast and loose with open source. From the name, you’d think OpenAI is open source. It was indeed open back when GPT-1 and GPT-2 were state-of-the-art. That was a long time — and billions in revenue — ago. Starting with GPL-3, OpenAI closed its doors. As Mark Dingemanse, a language scientist at Radboud University in Nijmegen, Netherlands said in a Nature article, “Some big firms are reaping the benefits of claiming to have open-source models while trying “to get away with disclosing as little as possible.” Free, Cloud and Open Source Business Intelligence Software in Indeed, Dingemanse and his colleague Andreas Liesenfeld found only one AI chatbot that could truly be described as open: The Hugging Face-hosted Large-Language Model (LLM) BigScience/BloomZ. Other LLMs that qualify are Falcon, FastChat-T5, and OpenLLaMA. But most LLMs contain proprietary, copyrighted, or simply unknown information their owners won’t tell you about. As the Electronic Frontier Foundation (EFF) observed, “Garbage In, Gospel Out.” Now, much of the innovative software driving AI is open source. TensorFlow is a versatile learning framework that supports multiple programming languages and is used for machine learning. PyTorch is popular for its dynamic computational graphs and ease of use in deep learning applications that quickly come to mind. Also: How open source attracts some of the world’s top innovators The LLMs and programs built on them are another story. All the most popular AI chatbots and programs are proprietary. So, why are companies claiming their projects are open source? By “open-washing” their efforts, businesses hope to gild their programs with open source’s positive connotations of transparency, collaboration, and innovation. They also hope to con developers into helping advance their own projects. It’s all about marketing. Clearly, we need to devise an open-source definition that fits AI programs to stop these faux-source efforts in their tracks. Unfortunately, that’s easier said than done. While people constantly fuss over the finer details of what’s open-source code and what isn’t, the Open Source Initiative (OSI) has nailed down the definition, the Open Source Definition (OSD), for almost twenty years. The convergence of open source and AI is much more complicated. In fact, Joseph Jacks, founder of the Venture Capitalist (VC) business FOSS Capital, argued there is “no such thing as open-source AI” since “open source was invented explicitly for software source code.” It’s true. In addition, open-source’s legal foundation is copyright law. As Jacks observed, “Neural Net Weights (NNWs) [which are essential in AI] are not software source code — they are unreadable by humans, nor are they debuggable.” As Stefano Maffulli, OSI executive director, has told me, software and data are mixed in AI, and existing open-source licenses are breaking down. Specifically, trouble emerges when all that data and code are merged in AI/ML artifacts — such as datasets, models, and weights. “Therefore, we need to make a new definition for open-source AI,” said Mafulli. Also: Switzerland’s federal government requires releasing its software as open source However, getting there hasn’t been easy. The main point of contention is the extent of openness required, particularly regarding training data. While some argue that releasing pre-trained models without the training data is sufficient, others argue that true open-source AI should also include access to the training data. As julia ferraioli (Stet: she spells her name in all lower case), Amazon Web Services (AWS) Open Source AI/ML Strategist, observed in a blog post, with the current OSI open-source AI definition 0.08 draft, “the only aspects of the data that a system desiring to be labeled as ‘open source AI’ would need to publish are: training methodologies and techniques; training data scope and characteristics; training data provenance (including how data was obtained and selected), training data labeling procedures, and training data cleaning methodology.” None of that, ferraioli continued, “gives the prospective adopter of the AI system insight into the data that was used to train the system.” Without this data, can an AI be open? Ferraioli argues it can’t. She’s not the only one who holds that position. She quotes her colleague, AWS Principal Open Source Technical Strategist Tom Callaway, who wrote, “Without requiring the data be open, it is not possible for anyone without the data to fully study or modify the LLM, or distribute all …