The Open Source AI Definition

version 0.0.7.1

:information_source: Note: This document is made of three parts: A preamble, stating the intentions of this document; the Definition of Open Source AI itself; and a checklist to evaluate legal documents.

:information_source: This document follows the definition of AI system adopted by the Organization for Economic and Co-operation Development (OECD)

An AI system is a machine-based system that, for explicit or implicit objectives, infers, from the input it receives, how to generate outputs such as predictions, content, recommendations, or decisions that can influence physical or virtual environments. Different AI systems vary in their levels of autonomy and adaptiveness after deployment.

More information about definitions of AI systems on OSI's blog.

Preamble

Why we need Open Source Artificial Intelligence (AI)

Open Source has demonstrated that massive benefits accrue to everyone when you remove the barriers to learning, using, sharing and improving software systems. These benefits are the result of using licenses that adhere to the Open Source Definition. The benefits can be summarized as autonomy, transparency, and collaborative improvement.

Everyone needs these benefits in AI. We need essential freedoms to enable users to build and deploy AI systems that are reliable and transparent.

Out of scope issues

The Open Source AI Definition doesn’t say how to develop and deploy an AI system that is ethical, trustworthy or responsible, although it doesn’t prevent it. The efforts to discuss the responsible development, deployment and use of AI systems, including through appropriate government regulation, are a separate conversation.

What is Open Source AI

An Open Source AI is an AI system made available under terms that grant the freedoms to:

Use the system for any purpose and without having to ask for permission.
Study how the system works and inspect its components.
Modify the system for any purpose, including to change its output.
Share the system for others to use with or without modifications, for any purpose.

Precondition to exercise these freedoms is to have access to the preferred form to make modifications to the system.

Checklist to evaluate machine learning systems

This checklist is based on the paper The Model Openness Framework: Promoting Completeness and Openness for Reproducibility, Transparency and Usability in AI published Mar 21, 2024.

Preferred form to make modifications to machine-learning systems

The default set of components required for a machine-learning Open Source AI are:

Data transparency: Sufficiently detailed information on how the system was trained. This may include the training methodologies and techniques, the training data sets used, information about the provenance of those data sets, their scope and characteristics; how the data was obtained and selected, the labeling procedures and data cleaning methodologies.
Code: The code used for pre-processing data, the code used for training, validation and testing, the supporting libraries like tokenizers and hyperparameters search code (if used), the inference code, and the model architecture.
Model: The model parameters, including weights. Where applicable, these should include checkpoints from key intermediate stages of training as well as the final optimizer state.

Table of default required components

Required components	Legal frameworks
Code
- Data pre-processing	Available under OSI-compliant license
- Training, validation and testing	Available under OSI-compliant license
- Inference	Available under OSI-compliant license
- Supporting libraries and tools	Available under OSI-compliant license
Model
- Model architecture	Available under OSI-compliant license
- Model parameters (including weights)	Available under terms compatible with Open Source principles
Data transparency
- Training methodologies and techniques	Available under OSI-compliant license
- Training data scope and characteristics	Available under OSI-compliant license
- Training data provenance (including how data was obtained and selected)	Available under OSI-compliant license
- Training data labeling procedures, if used	Available under OSI-compliant license
- Training data cleaning methodology	Available under OSI-compliant license

The following components are not required, but their inclusion in releases is appreciated.

Optional components
Code
- Code used to perform inference for benchmark tests
- Evaluation code
Data All data sets, including:
- Training data sets
- Testing data sets
- Validation data sets
- Benchmarking data sets
- Data cards
- Evaluation metrics and results
- All other data documentation
Model All model elements, including:
- Model card
- Sample model outputs
Other Any other documentation or tools produced or used, including:
- Thorough research papers
- Usage documentation
- Technical report
- Supporting tools

Comments

Sam Johnston

2024/04/13 00:16:40

The purpose of DFSG's source code provision ("The program must include source code, and must allow distribution in source code as well as compiled form.") is to enable users to modify behaviour and distribute the results in source (i.e., training data) and "compiled" (i.e., model weights) form. It's one thing to be able to deploy a model for inference — and indeed there's little point in distributing one without permission to use it — and another altogether to have the freedom to change it, for example by transforming, reducing, or expanding the training data. By making training, testing, and validation data set optional but "appreciated", this freedom is not protected; it's the AI equivalent of freeware distributed without source code. Granted, most models will not meet the definition, but most software is proprietary rather than open source. An example of a model that should meet the definition is one trained on Wikipedia, itself "available under open documentation license" (CC).

Matija Šuklje

2024/04/13 08:10:35

that grant the freedoms to

I find it still important to point out _who_ these freedoms need to be granted to. The deployer or end user? This is an important distinction (esp. after the “made available to the public” got, possibly rightly, removed) because the deployer would typically have more access than the end user, who would typically just see a prompt. A paralel in FOSS might be the deployer as hosting the software on their servers, and then offering it as SaaS to their end users. So who should be getting the freedoms? Is an Open Source AI also a system where one(!) deployer got all the four freedoms from a previous one, but neither does any other deployer exists, nor do the end users have these freedoms? (Edited)

2024/04/17 10:46:30

I’m also fine with “everyone”, but that’s still not clear.

CaseyValk

2024/04/18 02:52:24

Consider revising to something like this: Open Source AI is an AI system that is made available under terms that grant, without conditions or restrictions, the rights to: Use... Study.... Modify... Share ...

smaffulli

2024/04/23 14:31:29

@CaseyValk "without conditions or restrictions" wouldn't be acceptable as sometimes there are acceptable conditions in Open Source (think of copyleft licenses)

florihas

2024/04/16 07:14:33

may

maybe I'm missing something but why does it say "may" (which can be/is interpreted as optional) here as opposed to v0.0.6 and although the components are listed as required in the table below?

2024/04/20 10:07:49

I agree. Consider instead of "may" it should read "must". Or, consider rewriting to something like this (new language in * *): "Data transparency: Sufficiently detailed information on how the system was trained, *including without limitation*, the training methodologies and techniques, the training..."

2024/04/23 16:25:39

this is being addressed in the next draft

2024/04/17 10:10:00

OSI-compliant license

• OSD compliant; or • OSI approved?

2024/04/18 02:30:47

I feel that "OSI-compliant license" aligns with what the OSI site states: "Available under a license that complies with the Open Source Definition" Could consider spelling it out here, though, to remove any ambiguity? (Edited)

2024/04/23 15:34:05

agreed. For code the next version should say "OSI Approved License". For data information (since that includes mostly documentation) we should say something like OSD compatible license (NOTE: moving this debate to the forum https://discuss.opensource.org/t/how-to-describe-the-acceptable-terms-to-receive-documentation/313/1 )

2024/04/18 02:35:01

Available under terms compatible with Open Source principles

Would it be beneficial to state the principles, or point to them? This seems like an area that could become easily subjective.

2024/04/23 16:25:26

Chris

2024/04/19 10:23:25

but their inclusion in releases is appreciated.

This means best practise or just advocation ?

2024/04/20 10:19:32

I'd like to see this rewritten to something like: "The following components are not required to meet the Open Source AI definition and may be provided for convenience." (Edited)

2024/04/20 13:37:21

Also to consider -if those components are provided, can they be provided under different terms that don't meet the Open Source AI definition, or do they fall under the same OSI compliant license automatically. (Edited)

Anivar Aravind

2024/04/20 05:58:18

made available

Should the term be replaced with "released" or "distributed" to be more definitive?

2024/04/20 05:59:49

That adds more context to who is granting freedoms and to whom (Edited)

2024/04/20 10:03:10

I feel "made available" is the most appropriate because that's agnostic to form factor for an Open Source AI. The terms "released" and "distributed" imply to me that the only form factor considered for Open Source AI is a distribution / on-prem. For examples, consider the triggers for the AGPL and GPL. (Edited)

The Open Source AI Definition

version 0.0.7.1

Preamble

Why we need Open Source Artificial Intelligence (AI)

Out of scope issues

What is Open Source AI

Checklist to evaluate machine learning systems

Preferred form to make modifications to machine-learning systems

Table of default required components

Read more

Answers to frequently asked questions

Checklists-MOF

The Open Source AI Definition v1.0-RC2

The Open Source AI Definition v.0.0.8