Last updated Oct. 29, 2024
TL;DR: An Open Source AI is one made freely available with all necessary code, data and parameters under legal terms approved by the Open Source Initiative. For more details read below.
The point #2 of the Open Source Definition (OSD) says "The program must include source code [...] The source code must be the preferred form in which a programmer would modify the program [...]
. Nobody had a clear answer to what is the preferred form to modify an AI system so OSI offered to find one with the communities involved in a co-design process.
The Open Source Definition (OSD) refers to software programs. AI and specifically machine learning systems are not simply software programs but they blend boundaries with data, configuration options, documentation and new artifacts, like weights and biases. The Open Source AI Definition describes what is the preferred form to modify an AI system providing clarity to interpret the principles of the OSD in the domain of AI.
Open Source means giving anyone the ability to meaningfully fork (study and modify) your system, without requiring additional permissions, to make it more useful for themselves and also for everyone. This is why OSD #2 requires that the source code must be provided in the preferred form for making modifications. This way everyone has the same rights and ability to fork as the original developers, starting a virtuous cycle of innovation.
However, training data does not equate to a software source code. Training data is important to study modern machine learning systems. But it is not what AI researchers and practitioners necessarily use as part of the preferred form for making modifications to a trained model.
The Data Information and Code requirements allow Open Source AI systems to be forked by third-party AI builders downstream using the same information as the original developers. These forks could include removing non-public or non-open data from the training dataset, in order to train a new Open Source AI system on fully public or open data.
Because we want Open Source AI to exist also in fields where data cannot be legally shared, for example medical AI. Laws that permit training on data often limit the resharing of that same data to protect copyright or other interests. Privacy rules also give a person the rightful ability to control their most sensitive information – like decisions about their health. Similarly, much of the world’s Indigenous knowledge is protected through mechanisms that are not compatible with later-developed frameworks for rights exclusivity and sharing.
There are also many cases where terms of use of publicly-available data may give entity A the confidence that they may use it freely and call it "open data", but not give entity A the confidence they can give entity B guarantees in a different jurisdiction. Meanwhile, entity B may or may not feel confident to use that data in their jurisdiction. An example is so-called public domain data, where the definition of public domain varies country-by-country. Another example is fair-use or private data where the finding of fair use or privacy laws may require a good knowledge of the law of a given jurisdiction. This resharing is not so much limited as lacking legal certainty.
During our co-design process, relationships between the weights and the data drove the highest amount of community engagement. In the “System analysis” phase, the volunteer groups suggested that training code and data processing code was more important to modify the AI system than accessing the training and testing data. That result was validated in the “Validation phase” and suggested a path that allows Open Source AI to exist on equal grounds with proprietary systems: both can train on the same kind of data.
Some people believe that full unfettered access to all training data (with no distinction of its kind) is paramount, arguing that anything less would compromise full reproducibility of AI systems, transparency and security. This approach would relegate Open Source AI to a niche of AI trainable only on open data (see FAQ). That niche would be tiny, even relative to the niche occupied by Open Source in the traditional software ecosystem. The requirements of Data Information keep the same approach present in the Open Source Definition that doesn't mandate full reproducibility and transparency but enables them (i.e. reproducible builds). At the same time, setting a baseline requiring Data Information doesn't preclude others from formulating and demanding more requirements, like the Digital Public Goods Standard or the Free Systems Distribution Guidelines add requirements to the Open Source Definition.
One of the key aspects of OSI’s mission is to drive and promote Open Source innovation. The approach OSI takes here enables full user choice with Open Source AI. Users can keep the insights derived from training+data pre-processing code and description of unshareable training data and build upon those with their own unshareable data and give the insights derived from further training to everyone, allowing for Open Source AI in areas like healthcare. Or users can obtain the available and public data from the Data Information and retrain their model without any unshareable data resulting in more data transparency in the resulting AI system. Just like with copyleft and permissive licensing, this approach leaves the choice with the user.
There are four classes of data, based on their legal constraints, all of which can be used to train Open Source AI systems:
OSI believes that all these classes of data can be part of the preferred form of making modifications to the AI system. This approach both advances openness in all the components of the AI system and drives more Open Source AI, i.e. in private-first areas such as healthcare.
In legal circles, Skilled Person means any person having the current knowledge, experience and competence to perform a certain duty. This Wikipedia entry provides more details.
Yes. The Open Source AI Definition makes no distinction between what might be called AI system, model, or weights and parameters. To be called Open Source AI, whether the offering is characterized as an AI system, a model, or weights and parameters, the requirements for providing the preferred form for making modifications will be the same.
AI and software are radically different domains and drawing comparisons between them is rarely productive. OSD #2 doesn’t mandate that Open Source software uses only compilers released with an OSI-Approved License because compilers are standardized, de-jure (like ANSI C) or de-facto like TurboPascal or Python. It was generally accepted that to develop more Open Source software one could accept to use a proprietary development environment. For machine learning, the training code is not standardized and therefore it must be part of the preferred form of making modifications to preserve the right to fork an AI system.
The Open Source AI Definition does not specifically guide or enforce ethical, trustworthy, or responsible AI development practices. However, it does not put up any barriers that would prevent developers from adhering to such principles, if they chose to. The efforts to discuss the responsible development, deployment and use of AI systems, including through appropriate government regulation, are a separate conversation. A good starting point is OECD's Recommendation of the Council on Artificial Intelligence, Section 1: Principles for responsible stewardship of trustworthy AI
The Open Source AI Definition does not take any stance about the legal nature of Parameters. They may be free by their nature or a license or other legal instrument may be required to ensure their freedom. We expect this will become clearer over time, once the legal system has had more opportunity to address these issues. In any case, we require an explicit assertion accompanying the distribution of Parameters that assures they're freely available to all.
We used the word "terms" instead of "license" for models because, as mentioned above, we do not yet know what the legal mechanism will be to assure that the models are available to use, study, modify and share. We used "terms" to avoid suggesting that a "license" is the only legal mechanism that could be used. That said, to be approved by the OSI, the terms for parameters must assure the freedoms to use, study, modify and share.
The principles stated in the Open Source AI Definition are generally applicable to any kind of AI but it's machine learning that challenges the Open Source Definition. For machine learning, there is a set of artifacts (components) that are required to study and modify the system, thus requiring a new explanation of what's necessary to study and modify the system.
As part of our validation and testing of the OSAID, the volunteers checked whether the Definition could be used to evaluate if AI systems provided the freedoms expected. The list of models that passed the Validation phase are: Pythia (Eleuther AI), OLMo (AI2), Amber and CrystalCoder (LLM360) and T5 (Google). There are a couple of others that were analyzed and would probably pass if they changed their licenses/legal terms: BLOOM (BigScience), Starcoder2 (BigCode), Falcon (TII). Those that have been analyzed and don't pass because they lack required components and/or their legal agreements are incompatible with the Open Source principles: Llama2 (Meta), Grok (X/Twitter), Phi-2 (Microsoft), Mixtral (Mistral). These results should be seen as part of the definitional process, a learning moment, they're not certifications of any kind. OSI will continue to validate only legal documents, and will not validate or review individual AI systems, just as it does not validate or review software projects.