TTW: Machine Learning Model Licenses

# TTW: Machine Learning Model Licenses Location: [Guide for Reproducible Research/Licensing/Machine Learning Licenses](https://the-turing-way.netlify.app/reproducible-research/licensing.html) Like a software license, a Machine Learning (ML) model license governs the use, redistribution of the model and/or algorithm, and distribution any derivatives of it. However, there are other components to an AI system, such as data, source code, or applications, which may have their own separate licenses. ML model licenses may restrict the use of the model for specific scenarios for which, due to ethics-informed concerns, or technical limitations of the model informed by its model card, the licensor is not comfortable that the model is used. ~~or allocate liability associated using the model, its component parts, and its outputs.~~ While many ML models may utilise software licensing models (e.g. MIT, Apache 2.0), there are a number of ML model-specific licenses that may be developed for a specific model (e.g. [OPT-175B license](https://github.com/facebookresearch/metaseq/blob/main/projects/OPT/MODEL_LICENSE.md), [BigScience BLOOM RAIL v1.0 License](https://https://bigscience.huggingface.co/blog/the-bigscience-rail-license)), company (e.g. [Microsoft Data Use Agreement for Open AI Model Development](https://query.prod.cms.rt.microsoft.com/cms/api/am/binary/RE4Rjfq)), or generally applicable to a wide range of models (e.g. [BigScience OpenRAIL-M](https://www.licenses.ai/blog/2022/8/26/bigscience-open-rail-m-license) (Responsible AI License)). In summary, the growing list of ML licenses reflects the understanding that the ML model and other artefacts are distinct from the code, and thus in need of new licensing options. ## Reproduction and propagation of ML models Many similar or related versions of a model may exist, whether it is the evolution of a model family (e.g. GPT-2 and GPT-3) or implementations of the model in different languages (e.g. DALLE vs. DALLE-pytorch). Each version may have is own license, though some model developers are now requiring all downstream models to at least have the same base license as the original. In the most extreme cases, a model developer may choose to grant the license exclusively to a single entity or open every aspect of the model to the public domain. In the case of state-of-the-art deep learning models, the original versions of the models may not be open source (e.g. OpenAI's GPT-3 and DALLE), but versions created by the community may be made available under open licenses on other platforms such as Github or HuggingFace. ~~## The Montreal Data License framework~~ ~~[TODO: summarise taxonomy]~~ Carlos comment: I think both Microsfot Data license and Montreal one can be named here but analyzed under the data licenses section. ## Open & Responsible ML Licenses [Excerpts from https://huggingface.co/blog/open_rail] The "open source" approach to collaborative software development has permeated and influenced AI development and licensing practices. It is a common practice fo ML developers to use open source licenses to release their ML models. This is due to the fact that open source licenses have become a standard practice when it comes to the sharing of artefacts in the entire ICT space (e.g., software; datasets; models; apps). ML developers might colloquially refer to "open sourcing a model" when they make its weights available by attaching an official open source license, or any other open software or content license such as Creative Commons. However, open source licenses do not take the technical nature and capabilities of the ML model as a different artifact to software/source code into account, and are therefore ill-adapted to enabling a more responsible use of ML models. In order to balance the principles from open source with a growing demand of responsible ML development, use, and access, a new branch of ML licenses called Responsible AI Licenses (RAIL) emerged in 2019 with the [RAIL Initiative](https://www.licenses.ai/).Research initiatives such as [BigScience](https://bigscience.huggingface.co/) and companies such as [Hugging Face](https://huggingface.co/blog/open_rail) have decided to join efforts and push towards this direction along with the RAIL Initiative. ~~## Responsible AI Licenses (RAIL)~~ Responsible AI licenses target specific ethics-informed concerns by enacting use-based restrictions to mitigate potential harms associated with the use of AI-related products and services or component parts such as data, model, code, or applications. The integration of use-based restrictions clauses into open AI licenses brings up the ability to better control the use of AI artifacts and the capacity of enforcement to the licensor of the ML model, standing up for a responsible use of the released AI artifact, in case a misuse of the model is identified. While RAILs are the first step towards enabling ethics-informed behavioral restrictions, OpenRAILs go a step further and seek to strike a balance between open access and responsible use of the licensed AI artifact. For further information on the implementation of a Responsible AI License, check the material jointly provided by [BigScience and RAIL Initiative](https://www.licenses.ai/blog/2022/8/18/naming-convention-of-responsible-ai-licenses). ### Example: OpenRAIL-M The 2 main features of an OpenRAIL license are: - Open: these licenses allow royalty free access and flexible downstream use and re-distribution of the licensed material, and distribution of any derivatives of it. - Responsible: OpenRAIL licenses embed a specific set of restrictions for the use of the licensed AI artifact in identified critical scenarios. Use-based restrictions are informed by an evidence-based approach to ML development and use limitations which forces to draw a line between promoting wide access and use of ML against potential social costs stemming from harmful uses of the openly licensed AI artifact. Therefore, while benefiting from an open access to the ML model, the user will not be able to use the model for the specified restricted scenarios. OpenRAILs require downstream adoption of the use-based restrictions by subsequent re-distribution and derivatives of the AI artifact, as a means to dissuade users of derivatives of the AI artifact from misusing the latter. OpenRAILs are a vehicle towards the consolidation of an informed and respectful culture of sharing AI artifacts acknowledging their limitations and the values held by the licensors of the model. In practical terms, every RAIL or OpenRAIL license requires that the set of use-based restrictions included in it must also be included in subsequent re-distributions or derivative versions of the ML model. For instance, all BLOOM RAIL, BigScience OpenRAIL-M, and CreativeML OpenRAIL-M include the same provisions 4.a. and 5 which require the licensee when distributing the model or derivatives of it to include -at minimum- the same use-based restrictions. It is important to acknowledge that RAILs and OpenRAILs should not be conceived as instruments which, due to excessive use-based restrictions, could hinder incremental innovation in the AI space. Consequently, as BigScience clarified in its BLOOM RAIL [FAQ](https://bigscience.huggingface.co/blog/the-bigscience-rail-license), the licensor can always at his/her own discretion make an exception and open some of the restrictions when a licensee justifies that the model has been expressly modified to avoid any concern and/or harm for the specific case at sight. ## Examples of ML models and their licenses The table below showcases several well-known examples of ML models in the filds of NLP, vision, and multimodal generatives. The aim of it is to inform the reader on the licensing options chosen by each of the projects which sometimes differ from one another. Licensing difference might stem from business models, research purposes or ethics-informed community values. Each license carries licensor's values and a message from the former to potential users. | Model | Model License | Description | Link to License | | -------- | -------- | -------- | -------- | | GPT-2 | MIT License + generated output disclaimer | Permissive open source license |https://github.com/openai/gpt-2/blob/master/LICENSE | GPT-3 | Exclusive | Licensed to |Microsoft | https://openai.com/blog/openai-licenses-gpt-3-technology-to-microsoft/ | | YOLO | YOLO License | Public domain license | https://github.com/pjreddie/darknet/blob/master/LICENSE | | DALLE-pytorch | MIT License | Pytorch implementation of DALLE created by individual researcher | https://github.com/lucidrains/DALLE-pytorch/blob/main/LICENSE | | Stable Diffusion | CreativeML Open RAIL-M | Open & Responsible AI License (RAIL) created by Stability.ai and adapted from the BLOOM RAIL license, including use-based restrictions (see attachment A) | https://huggingface.co/spaces/CompVis/stable-diffusion-license | | OPT | OPT-175B License | Meta restrictive license enabling use of the model weights for research purposes while establishing a set of use-based restrictions, which could be considered a RAIL | https://github.com/facebookresearch/metaseq/blob/main/projects/OPT/MODEL_LICENSE.md | | BigScience | BigScience OpenRAIL-M | Open & Responsible AI License (RAIL) created by BigScience and adapted from the BLOOM RAIL license, including use-based restrictions (see attachment A) | https://huggingface.co/spaces/bigscience/license| | Tsinghua University | GLM-130B license | Restrictive license enabling use of the model weights for research purposes | https://github.com/THUDM/GLM-130B/blob/main/MODEL_LICENSE ## Case Studies: Creating RAIL Licenses ### The case of the BLOOM RAIL v1.0 license to be completed by Carlos ### Responsible AI Licenses used for research purposes in academic settings (hypothetical example) Cidney, an ML researcher working for a university lab, has developed as part of her PhD research an ML vision model for facial recognition. She is well aware of both the good quality of the model and its limitations, and is willing to inform users about it in the model card she is planing to release along with the model. Cidney really wants to openly release the model to foster further research in the field, for other researchers to test it and provide feedback to her, or even to come up with improved versions of it. However, she is concerned about potential uses of the model which might lead to undesired outcomes, according to what she thinks is not good to use the model for, and also informed about her research lab ethical guidelines and code of conduct. Consequently, she decides to use a Responsible AI License to release her ML Model: 1. She wants to place use-based restrictions for specific identified scenarios informed by her research litterature review, her experience, and her awareness of the model technical limitations. Thus, she comes up with a set of scenarios where due to her concerns and the technical capabilities of the model she does not feel comfortable that the model is used for. Eventually, when drafting the restrictions, she will ask for legal advice to the university legal staff. 2. She decides to license just the pre-trained model, not the training dataset nor an app embedding it. The source code is already available out there with an open source license, so no need to license it again. 3. Her research lab allows her to release the model on an open basis for research purpose, thus enabling free access and distribution of the model solely for research and non-commercial purposes. 4. As a result, Cidney will use a RAIL-M license to release her model. Even though the model is accesible on an open basis, the license does not allow for a permissive distribution of it for commercial purposes. ### Open & Responsible AI Licenses used by a startup (hypothetical example) HealthyML is an ML startup focused in the health sector and developing innovative solutions for market niches such as medicines testing processes, cardiobascular predictive algorithms, and protein folding, among other areas. HealthyML is working on a platform wherein companies will be able to integrate their pipelines and end-consumers will be able share the data generated by their smart-watches when running. The company is aware of the fierce competition in the market and wants to leverage the network effects-related capabilities of an open platform. Therefore, HealthyML is consdering to release several of its ML models, but also fine tuned versions embedded in specific software apps. The goal is to generate traction and foster adoption of the platform in the short term y allowing companies and users to test and experience. Moreover, they are already in talks with some investors and have agreed to start discussions for their next investment round in 3 months. HealthyML is well aware of the capabilities of their ML models and apps, and according to its values as a company striving for ethics-informed research and ML development, it is reluctant to place the technology in the market under an open source license, as users could do whatever they would like with the technology and the company would not be able to control potential misuses. Accordingly, the startup decides to release the ML models and apps with Open & Responsible AI licenses. On the one hand, they would like companies with whom they are collaborating and other researchers to access, test and further improve or build upon their products. On the other hand, they know investors could highly appreciate that the company is taking pioneer steps forward for a responsible use and distribution of ML artifacts in the AI space, as a much needed and new trend. HealthyML decides to use an Open & Responsible AI License for its cardiobascular prediction ML app: 1. It wants to place use-based restrictions for specific identified scenarios informed by the company researchers' experiments and findings, and its awareness of the ML app capabilities. Thus, HealthyML comes up with a set of scenarios where due to its concerns and the technical capabilities of the model the startup does not feel comfortable that the model is used for. 2. HealthyML decides to license a finetuned model embedded in a software app, not the training dataset. The use-based restrictions will apply to both the ML model and app. 3. It seeks to release the model on an open basis enabling flexible downstream distribution for commercial purposes also. 4. As a result, HealthyML will use an OpenRAIL-AM license.