The Open Source AI Definition

version 0.0.9

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

This document follows the definition of AI system adopted by the Organization for Economic and Co-operation Development (OECD)

An AI system is a machine-based system that, for explicit or implicit objectives, infers, from the input it receives, how to generate outputs such as predictions, content, recommendations, or decisions that can influence physical or virtual environments. Different AI systems vary in their levels of autonomy and adaptiveness after deployment.

More information about definitions of AI systems on OSI's blog.

Preamble

Why we need Open Source Artificial Intelligence (AI)

Open Source has demonstrated that massive benefits accrue to everyone after removing the barriers to learning, using, sharing and improving software systems. These benefits are the result of using licenses that adhere to the Open Source Definition. For AI, society needs the same essential freedoms of Open Source to enable AI developers, deployers and end users to enjoy those same benefits: autonomy, transparency, frictionless reuse and collaborative improvement.

What is Open Source AI

When we refer to a "system," we are speaking both broadly about a fully functional structure and its discrete structural elements. To be considered Open Source, the requirements are the same, whether applied to a system, a model, weights and parameters, or other structural elements.

An Open Source AI is an AI system made available under terms and in a way that grant the freedoms^[1] to:

Use the system for any purpose and without having to ask for permission.
Study how the system works and inspect its components.
Modify the system for any purpose, including to change its output.
Share the system for others to use with or without modifications, for any purpose.

These freedoms apply both to a fully functional system and to discrete elements of a system. A precondition to exercising these freedoms is to have access to the preferred form to make modifications to the system.

Preferred form to make modifications to machine-learning systems

The preferred form of making modifications to a machine-learning system is:

Data information: Sufficiently detailed information about the data used to train the system, so that a skilled person can recreate a substantially equivalent system using the same or similar data. Data information shall be made available with licenses that comply with the Open Source Definition.
- For example, if used, this would include the training methodologies and techniques, the training data sets used, information about the provenance of those data sets, their scope and characteristics, how the data was obtained and selected, the labeling procedures and data cleaning methodologies.
Code: The source code used to train and run the system, made available with OSI-approved licenses.
- For example, if used, this would include code used for pre-processing data, code used for training, validation and testing, supporting libraries like tokenizers and hyperparameters search code, inference code, and model architecture.
Weights: The model weights and parameters, made available under OSI-approved terms^[2].
- For example, this might include checkpoints from key intermediate stages of training as well as the final optimizer state.

Open Source models and Open Source weights

For machine learning systems,

An AI model consists of the model architecture, model parameters (including weights) and inference code for running the model.
AI weights are the set of learned parameters that overlay the model architecture to produce an output from a given input.

The preferred form to make modifications to machine learning systems also applies to these individual components. “Open Source models” and “Open Source weights” must include the data information and code used to derive those parameters.

These freedoms are derived from the Free Software Definition. ↩︎
The Open Source AI Definition does not take any stance as to whether model parameters require a license, or any other legal instruments, and whether they can be legally controlled by any such instruments once disclosed and shared. ↩︎

Joshua Gay

2024/08/23 16:14:15

The term "AI systems" is preferred over "machine-learning systems" to encompass the full spectrum of artificial intelligence technologies. While machine learning is a critical component of AI, not all AI systems rely on machine learning techniques. AI includes a broader array of methodologies, such as rule-based systems, expert systems, optimization algorithms, and more. By using "AI systems," the document ensures inclusivity and relevance across all AI technologies, rather than limiting its scope to machine learning-based approaches.

Ignatius Ezeani

2024/09/03 17:39:13

I agree with this view. Thanks Joshua!

smaffulli

2024/09/04 09:44:26

@joshua: your comment is not connected to a specific text, it's not clear to me what you're referring to. Are you referring to "Preferred form to make modifications" tied to ML? The reason for that is that the other AI systems you mention are software, the "traditional" OSD applies. Can you please clarify?

2024/08/23 16:31:21

Don't narrow it to machine-learning systems, but even if you do, it is really NN systems that weights matter this much. But as you have already stated, weights are a kind of parameter, so just stick with parameters as your focus. Here is a suggested rewrite: Parameters: The model parameters, such as weights or other configuration settings, made available under OSI-approved terms[2]. For example, this might include checkpoints from key intermediate stages of training in neural networks, decision boundaries in support vector machines, tree structures in decision trees, or the final optimizer state in various algorithms.

2024/09/04 09:48:31

Your comment is not tied to the text but I can see where your suggestion should go.

2024/08/23 16:46:44

I don't really understand this section. I suggest broader language that covers more AI systems should be used over narrower terminology. -Open Source models and Open Source weights +Open Source models and Open Source parameters -For machine learning systems, +For AI systems, -An AI model consists of the model architecture, model parameters (including weights) and inference code for running the model. +An AI model consists of the model architecture or algorithm, model parameters (including weights, decision boundaries, tree structures, etc.), and inference code for running the model. -AI weights are the set of learned parameters that overlay the model architecture to produce an output from a given input. +AI parameters, such as weights or decision boundaries, are the set of configuration settings that overlay the model architecture or algorithm to produce an output from a given input.

zack

2024/08/28 11:34:30

the system.

(editorial) better: "any part of the system".

2024/09/03 17:43:18

Machine Learning systems can provide executable payloads stored in the model. This creates a loophole that allows the traditional "binary blob" problem to happen in a new way. Suggested text to explicitlyclose the loophole: "The use of pre-compiled binaries or any other non-modifiable elements stored in the model and that are then loaded and executed by the AI system is not permitted under the Open Source AI definition. This prohibition includes, but is not limited to, runtime-loaded binaries, device drivers, or other ancillary files that are necessary for the system's operation. The intent is to ensure that no traditional software components can be superficially reclassified as AI Systems to circumvent open-source principles. AI systems that rely on binary blobs, whether as part of their data processing pipeline, model training, or runtime execution, cannot be labeled as Open Source AI unless all such components are provided under terms that conform with the Open Source Definition."

2024/09/04 09:57:27

Should this be in a general purpose definition or should this rather be incorporated in a checklist and operations manual, like the DFSG?

2024/09/04 22:14:22

Perhaps in both. The core principle (no binary blobs or proprietary runtime-loaded components) should be included in the general definition to make it a central tenet of Open Source AI. Embedding the principle directly into the definition, you ensure that every user and developer immediately understands that this practice is not allowed in any open-source AI context. This can be accompanied by more detailed operational guidance in a checklist or manual that provides specific examples and elaborates on what practices are prohibited, including how binary blobs might be hidden in AI systems. This approach would help ensures that the principle is clear and immutable in the definition while giving practical implementation details in the operational documentation.

Yann Lechelle

2024/09/17 08:26:24

models

Capitalize

2024/09/17 08:26:33

weights