owned this note
owned this note
Published
Linked with GitHub
# The Open Source AI Definition
### version 0.0.8
:::info
:information_source: Note: This document is made of three parts: A preamble, stating the intentions of this document; the Definition of Open Source AI itself; and a checklist to evaluate legal documents.
:::
:::info
:information_source: This document follows the definition of AI system adopted by the [Organization for Economic and Co-operation Development (OECD)](https://legalinstruments.oecd.org/en/instruments/OECD-LEGAL-0449)
> An AI system is a machine-based system that, for explicit or implicit objectives, infers, from the input it receives, how to generate outputs such as predictions, content, recommendations, or decisions that can influence physical or virtual environments. Different AI systems vary in their levels of autonomy and adaptiveness after deployment.
More information about definitions of AI systems on [OSI's blog](https://blog.opensource.org/open-source-ai-establishing-a-common-ground/).
:::
# Preamble
## Why we need Open Source Artificial Intelligence (AI)
Open Source has demonstrated that massive benefits accrue to everyone when you remove the barriers to learning, using, sharing and improving software systems. These benefits are the result of using licenses that adhere to the Open Source Definition. The benefits can be summarized as autonomy, transparency, frictionless reuse, and collaborative improvement.
Everyone needs these benefits in AI. We need essential freedoms to enable users to build and deploy AI systems that are reliable and transparent.
# What is Open Source AI
An Open Source AI is an AI system made available under terms that grant the freedoms to:
* **Use** the system for any purpose and without having to ask for permission.
* **Study** how the system works and inspect its components.
* **Modify** the system for any purpose, including to change its output.
* **Share** the system for others to use with or without modifications, for any purpose.
Precondition to exercise these freedoms is to have access to the preferred form to make modifications to the system.
## Preferred form to make modifications to machine-learning systems
The preferred form of making modifications for a machine-learning Open Source AI must include:
* **Data information**: Sufficiently detailed information about the data used to train the system, so that a skilled person can recreate a substantially equivalent system using the same or similar data.
* For example, if used, this would include the training methodologies and techniques, the training data sets used, information about the provenance of those data sets, their scope and characteristics, how the data was obtained and selected, the labeling procedures and data cleaning methodologies.
* **Code**: The source code used to train and run the system.
* For example, if used, this would include code used for pre-processing data, code used for training, validation and testing, supporting libraries like tokenizers and hyperparameters search code, inference code, and model architecture.
* **Model**: The model parameters.
* For example, this might include checkpoints from key intermediate stages of training as well as the final optimizer state.
# Checklist to evaluate machine learning systems
:::info
This checklist is based on the paper [The Model Openness Framework: Promoting Completeness and Openness for Reproducibility, Transparency and Usability in AI](https://arxiv.org/abs/2403.13784) published Mar 21, 2024.
:::
### Table of default required components
| Required components | Legal frameworks |
| ------------------------| ------------------------------ |
| **Data information** |
| - Training methodologies and techniques | Available under OSD-compliant license |
| - Training data scope and characteristics | Available under OSD-compliant license |
| - Training data provenance (including how data was obtained and selected) | Available under OSD-compliant license |
| - Training data labeling procedures, if used | Available under OSD-compliant license |
| - Training data cleaning methodology | Available under OSD-compliant license |
| **Code** |
| - Data pre-processing | Available under OSI-approved license |
| - Training, validation and testing | Available under OSI-approved license |
| - Inference | Available under OSI-approved license |
| - Supporting libraries and tools | Available under OSI-approved license |
| **Model** |
| - Model architecture | Available under OSI-approved license |
| - Model parameters | Available under OSD-conformant terms |
The following components are not required as the preferred form of making modifications, but their inclusion in releases is appreciated.
| Optional components | Legal frameworks |
| ------------------------| ------------------------------ |
| **Data information** All data sets, including: |
| - Training data sets | Available under OSD-compliant license |
| - Testing data sets | Available under OSD-compliant license |
| - Validation data sets | Available under OSD-compliant license |
| - Benchmarking data sets | Available under OSD-compliant license |
| - Data card| Available under OSD-compliant license |
| - Evaluation data | Available under OSD-compliant license |
| - Evaluation results | Available under OSD-compliant license |
| - Other data documentation| Available under OSD-compliant license |
| **Code** |
| - Code used to perform inference for benchmark tests | Available under OSI-approved license |
| - Evaluation code | Available under OSI-approved license |
| **Model** All model elements, including:|
| - Model card | Available under OSD-compliant license |
| - Sample model outputs| Available under OSD-compliant license |
| - Model metadata | Available under OSD-compliant license |
| **Other** Any other documentation or tools produced or used, including:|
| - Research papers | Available under OSD-compliant license |
| - Technical report | Available under OSD-compliant license |