# Answers to frequently asked questions
Last updated Sep. 27, 2024
# :warning: This documet is still a work in progress :warning:
## What's the difference between the Open Source Definition and the Open Source AI Definition?
FIXME
## What is the role of training data in the Open Source AI Definition?
Open Source means giving anyone the ability to meaningfully fork (study and modify) your system, without requiring additional permissions, to make it more useful for themselves and also for everyone. This is why OSD \#2 requires that the source code must be provided in the preferred form for making modifications. This way everyone has the same rights and ability to fork as the original developers, starting a virtuous cycle of innovation.
Similarities with software end here: Training data is important to study modern machine learning systems. But it is not what AI builders necessarily use as part of the preferred form for making modifications to a trained model.
The Data Information and Code requirements allow Open Source AI systems to be forked by third-party AI builders downstream using the same information as the original developers. These forks could include removing non-public or non-open data from the training dataset, in order to train a new Open Source AI system on fully public or open data.
### Why do you allow the exclusion of some training data?
Data can be hard to share. Laws that permit training on data often limit the resharing of that same data to protect copyright or other interests. Privacy rules also give a person the rightful ability to control their most sensitive information – like decisions about their health. Similarly, much of the world’s Indigenous knowledge is protected through mechanisms that are not compatible with later-developed frameworks for rights exclusivity and sharing.
There are also many cases where terms of use of publicly-available data may give entity A the confidence that they may use it freely and call it "open data", but not give entity A the confidence they can give entity B guarantees in a different jurisdiction. Meanwhile, entity B may or may not feel confident to use that data in their jurisdiction. An example is so-called public domain data, where the definition of public domain varies country-by-country. Another example is fair-use or private data where the finding of fair use or privacy laws may require a good knowledge of the law of a given jurisdiction. This resharing is not so much *limited* as [lacking legal certainty](https://opensource.org/blog/copyright-law-makes-a-case-for-requiring-data-information-rather-than-open-datasets-for-open-source-ai).
### How did you arrive at this conclusion? Is it compromising Open Source ideals?
During our co-design process, relationships between the weights and the data drove the highest amount of community engagement. In the [“System analysis” phase](https://discuss.opensource.org/t/report-on-working-group-recommendations/247), the volunteer groups suggested that training code and data processing code was more important to modify the AI system than accessing the training and testing data. That result was validated in the [“Validation phase”](https://discuss.opensource.org/t/initial-report-on-definition-validation/368) and suggested a path that allows Open Source AI to exist on equal grounds with proprietary systems: both can train on the same [kind of data](#What-kind-of-data-should-be-required-in-the-Open-Source-AI-Definition).
Some people believe that full unfettered access to all training data (with no distinction of its [kind](#What-kind-of-data-should-be-required-in-the-Open-Source-AI-Definition)) is paramount, arguing that anything less would compromise full reproducibility of AI systems, transparency and security. This approach would relegate Open Source AI to a niche of AI trainable only on open data (see [FAQ](#What-kind-of-data-should-be-required-in-the-Open-Source-AI-Definition)). That niche would be tiny, even relative to the niche occupied by Open Source in the traditional software ecosystem. The requirements of Data Information keep the same approach present in the Open Source Definition that doesn't mandate full reproducibility and transparency but enables them (i.e. [reproducible builds](https://reproducible-builds.org/)). At the same time, setting a baseline requiring Data Information doesn't preclude others from formulating and demanding more requirements, like the [Digital Public Goods Standard](https://digitalpublicgoods.net/standard/) or the [Free Systems Distribution Guidelines](https://www.gnu.org/distros/free-system-distribution-guidelines.html) add requirements to the Open Source Definition.
One of the key aspects of OSI’s mission is to drive and promote Open Source innovation. The approach OSI takes here enables full user choice with Open Source AI. Users can keep the insights derived from training+data pre-processing code and description of unshareable training data and build upon those with their own unshareable data and give the insights derived from further training to everyone, allowing for Open Source AI in areas like healthcare. Or users can obtain the available and public data from the Data Information and retrain their model without any unshareable data resulting in more data transparency in the resulting AI system. Just like with copyleft and permissive licensing, this approach leaves the choice with the user.
### What kind of data should be required in the Open Source AI Definition?
There are four classes of data, based on their legal constraints, all of which can be used to train Open Source AI systems:
* **Open training data**: data that can be copied, preserved, modified and reshared. It provides the best way to enable users to study the system. This must be shared.
* **Public training data**: data that others can inspect as long as it remains available. This also enables users to study the work. However, this data can degrade as links or references are lost or removed from network availability. To obviate this, different communities will have to work together to define standards, procedures, tools and governance models to overcome this risk, and Data Information is required in case the data becomes later unavailable. This must be disclosed with full details on where to obtain it.
* **Obtainable training data**: data that can be obtained, including for a fee. This information provides transparency and is similar to a purchasable component in an open hardware system. The Data Information provides a means of understanding this data other than obtaining or purchasing it. This is an area that is likely to change rapidly and will need careful monitoring to protect Open Source AI developers. This must be disclosed with full details on where to obtain it.
* **Unshareable non-public training data**: data that cannot be shared for explainable reasons, like Personally Identifiable Information (PII). For this class of data, the ability to study some of the system's biases demands a detailed description of the data – what it is, how it was collected, its characteristics, and so on – so that users can understand the biases and categorization underlying the system. This must be revealed in detail so that, for example, a hospital can create a dataset with identical structure using their own patient data.
OSI believes that all these classes of data can be part of the preferred form of making modifications to the AI system. This approach both advances openness in all the components of the AI system and drives more Open Source AI, i.e. in private-first areas such as healthcare.
## What is a skilled person?
In legal circles, **Skilled Person** means any person having the current knowledge, experience and competence to perform a certain duty. This [Wikipedia entry](https://en.wikipedia.org/wiki/Person_having_ordinary_skill_in_the_art) provides more details.
## Is the Open Source AI Definition covering models and weights and parameters?
Yes. The Open Source AI Definition makes no distinction between what might be called AI system, model, or weights and parameters. To be called Open Source AI, whether the offering is characterized as an AI system, a model, or weights and parameters, the requirements for providing the preferred form for making modifications will be the same.
## Why do you require training code while OSD \#2 doesn’t require compilers?
AI and software are radically different domains and drawing comparisons between them is rarely productive. OSD \#2 doesn’t mandate that Open Source software uses only compilers released with an OSI-Approved License because compilers are standardized, de-jure (like ANSI C) or de-facto like TurboPascal or Python. It was generally accepted that to develop more Open Source software one could accept to use a proprietary development environment.
For machine learning, the training code is not standardized and therefore it must be part of the preferred form of making modifications to preserve the right to fork an AI system.
## Why is there no mention of safety and risk limitations in the Open Source AI Definition?
The Open Source AI Definition does not specifically guide or enforce ethical, trustworthy, or responsible AI development practices. However, it does not put up any barriers that would prevent developers from adhering to such principles, if they chose to. The efforts to discuss the responsible development, deployment and use of AI systems, including through appropriate government regulation, are a separate conversation. A good starting point is OECD's Recommendation of the Council on Artificial Intelligence, [Section 1: Principles for responsible stewardship of trustworthy AI](https://legalinstruments.oecd.org/en/instruments/oecd-legal-0449)
## Are model parameters copyrightable?
The grant of only a copyright license for an AI model parameters may not be enough to assure all the necessary freedoms. There are a lot of opinions about whether model parameters are protected by any rights regime at all and, if they are, by which one. It's still not clear if the parameters are protectable by some other regime (contract, database rights, or perhaps newly created rights), a grant of a copyright license only isn’t going to ensure that the model is fully available as required by an Open Source software license.
## Why is the "Preferred form to make modifications" limited to machine learning?
The principles stated in the Open Source AI Definition are generally applicable to any kind of AI but it's machine learning that challenges the Open Source Definition. For machine learning, there is a set of artifacts (components) that are required to study and modify the system, thus requiring a new explanation of what's necessary to study and modify the system.