TTW + HuggingFace/BigScience

Proposed Work

  1. Guide for Ethical Research/Ethics-Informed Licensing
  • Eirini & Jen
  1. Guide for Reproducible Research/Licensing/Machine Learning Licenses
  • Jen & Carlos
  1. Guide for Reproducible Research/Open Research/Open AI
  • Jen & Carlos

Meeting Notes

7 Sept Collab Cafe

  • Outcome: create issue to document 1 of the chapters
  • Pomodoro 1
    • Abhishek Gupta (Montreal Ethics AI): social perspective licensing (e.g. Open & Responsible AI Licensing)
      • Open licenses a way to communicate to the public on their intentions about the work
      • Not necessarily for the purpose of litigation, but to set community norms and embedding community values
      • Open RAIL had an organic birth because of BigScience
        • First thought was Apache 2.0, but realised we can't fully open this in the classic OSI term
        • Striking a balance between open & responsible innovation
        • Enabling open access, use, reuse, while also placing responsible use restrictions in specific critical scenarios (shaped by technical limitations of the model & ethical charter)
    • A new thing bc of the org (in contrast to fully freedom zero vs. commercial licenses) and also the thing (AI which is more than code)
    • Anne feedback on audiences:
      - Wikipedia-like resource to look up a DS/tech term
      - Domain-specific practitioners (e.g. health)
      - Wider research infrastructure world
      • Case study could also be a good format
      • Emerging licensing schemas (building the railroad as you ride it)
        • BLOOM License > Open RAIL
      • Maybe something more general about licensing?
      • What's the point of licensing if you can't enforce it? What is the purpose of a license? FAQ/Key questions
      • Future blog posts from BigScience on enforcement
    • Context/introduction, licensing, challenges ahead
      • Link to FAQ for more context
  • Pomodoro 2
    • Introduction can provide the context of why ML licenses exist in addition to just software/data
    • Stick with the existing structure + add case study
    • Derived model section
      • Open derivatives of AI are a big phenomenon
        • After release you have a TON of versions
        • This is where the licenses plays an important role -> why BigScience used copyleft approach (at minimum same restrictions as the first one)
    • ML model table w/ licenses
      • Add paragraph explaining the role of this table
        • A way to communicate culture / community values
      • There are a lot of models but not a lot of LLMs
        • NLP: GPT-2 (MIT), GPT-3, Chinese model (Apache), BLOOM, OPT-175B, BERT (Apache 2.0),
        • CV: OPENCV, YOLO, DALLE, SEER
    • Case study: Creating Open RAIL
  • Jen to updated HackMD; Carlos edits next Wed and we chat next Fri

MozFest Presentation

26 Aug Meeting

  • Capture guidance on Responsible/Ethical licenses for AI
  • Ethics-informed licenses (ethics is an outcome due to the context/resourced)
    • Intro to what it means for a license to be ethical
  • Open AI chapter: Policy battle on who defines/owns the term "openness"
    • BigScience approach to open: Royalty free use/reuse; flexible downstream redistribution
      • Next version will have to embed the previous/genesis version's license
    • Meta: open access to the artefact; gated API access to control who is using
  • What other dimensions of AI should be open?
    • Multidimensional approach (more than just access)
    • Process, data, governance, policy
    • Network effects, ethics, better together, policy

Research Notes

  • https://techcrunch.com/2022/09/06/the-eus-ai-act-could-have-a-chilling-effect-on-open-source-efforts-experts-warn/
    • Brookings: AI Act would burden open source developers to adhere to guidelines for risk management, data governance, technical documentation and transparency, as well as standards of accuracy and cybersecurity.
    • “This could further concentrate power over the future of AI in large technology companies and prevent research that is critical to the public’s understanding of AI,” Alex Engler
    • The legislation contains carve-outs for some categories of open source AI, like those exclusively used for research
    • Stable Diffusion (Open-RAIL M) being used to create porn deepfakes
    • HuggingFace: AI Act as proposed is too vague - is it for "pre-trained" models or the software?
      • Responsible licensing is starting to become a common practice for major AI releases
      • Open innovation and responsible innovation in the AI realm are not mutually exclusive ends, but rather complementary ones
      • The intersection between both should be a core target for ongoing regulatory efforts, as it is being right now for the AI community

https://gist.github.com/nicolasdao/a7adda51f2f185e8d2700e1573d8a633

  • An inspiration on easy, readable explanation of licenses and how to to choose the right one for you

https://huggingface.co/blog/open_rail

  • Open RAIL enable open access, use, and distribution of AI artifacts while also requiring response use
  • "Open sourcing a model without taking due account of its impact, use, and documentation could be a source of concern in light of new AI regulatory trends. Henceforth, OpenRAILs should be conceived as instruments articulating with ongoing AI regulatory trends and part of a broader system of AI governance tools, and not as the only solution enabling open and responsible use of AI."
  • "Open licensing is one of the cornerstones of AI innovation. Licenses as social and legal institutions should be well taken care of. They should not be conceived as burdensome legal technical mechanisms, but rather as a communication instrument among AI communities bringing stakeholders together by sharing common messages on how the licensed artifact can be used."

https://www.licenses.ai/blog/2022/8/18/naming-convention-of-responsible-ai-licenses

  • RAIL (licenses.ai) established in 2019 for adoption of behavioural use-based restrictions in license and contracts to mitigate risk of harm from sharing AI tech
  • BigScience RAIL License from BLOOM
    • Use-based restrictions on LLM and its derivatives (but not to source code)
    • Opened up questions around (i) nature of artefacts being licensed (data, source, model, binaries/executables) (ii) what constitutes derivative works for each (iii) whether artefacts license enables permissive downstream distribution and derivative versions
  • RAIL licenses require 1) behavioural-use restrictions to disallow/restrict certain applications 2) require downstream use/re-distribution to include the same restrictions (at min)
    • RAIL-D: use restrictions only applied to data
    • RAIL-A: ‘’ to application/executable
    • RAIL-M: ‘’ to model
    • RAIL-S: ‘’ to source code
    • RAIL-DAMS = RAIL licenses with restrictions to all of the above
  • Open RAIL: clarifies that the licensor offers the licensed artefact at no charge and allow licensees to relicense artefact and derivative works as they choose as long as Use Restrictions apply
  • Open RAIL is an attempt to provide practitioners more control over how what they create is used while also creating a mechanism to license broadly and permissively

https://twitter.com/carlos_mferr/status/1563081644302426113?s=21&t=45Bbu6jTqtow9kioqUGb1Q

  • Licensing inspired by open source movements and led by evidence-based approaches from ML
  • BigScience OpenRAIL-M is adapted from BLOOM RAIL
  • Offer the AI community an open & responsible AI license for more models, another instrument to support the ecosystem of AI Governance tools (model cards, eval benchmarks, ethical charters, CoC)

https://bigscience.huggingface.co/blog/the-bigscience-rail-license

  • BigScience wants to ensure free worldwide access to LLMs through a multicultural and responsible approach to the dev and release of these artefacts
  • Balance maximising access and mitigating risks associated with these models
  • The fact that a software license is deemed "open" ( e.g. under an "open source" license ) does not inherently mean that the use of the licensed material is going to be responsible
  • Responsible AI License concept emerged from a community initiative to empower devs to place restrictions on the use of their AI tech through license agreements
    • Balance fostering innovation and protecting public interest and fundamental rights
    • EU AI Act includes both hard and soft law (high-risk AI vs. CoC) and new AI regulatory sandboxes
    • Solutions should take a multi-dimensional approach where technical tools are complemented by education/training, governance frameworks, etc.
  • Apache 2.0 applicable to resources used to develop the Model, new license modified for access/distribution of the Model to promote responsible use
    • Open and permissive license with use-based restrictions
  • License covers models, training checkpoints, source code/documentation (Complementary Material)
  • Not an open source license according to the Open Source Initiative because it has use-based restrictions
    • But none on reuse, distribution, commercialisation, adaptation
    • Use-based restrictions only apply to model and complementary materials not the rest, e.g. source code
  • We are conscious that the concept of "harm" is not as straightforward, even more so from a legal perspective. We have drafted our use case restrictions informed by the opinion of technical experts, experimental and empirical results on AI fairness evaluation, and ongoing legislative proposals

Open Sourcing AI: Intellectual Property at the Service of Platform Leadership
https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4018413

  • Tech companies adopt IP strategies to protect their investment in the field (e.g. copyright, patents, trade secrets)
  • Patenting is on the rise but so is open source projects by same actors
  • Strategic reasons behind adoption of open source licensing in AI
  • “Openness” as a competitive factor to build user base and ecosystem

OSI (https://opensource.org/osd) vs Creative Commons (https://creativecommons.org/about/cclicenses/)

  • OSI: "open source doesn't just mean access to the source code"
    • Free redistribution, source code, derived works, no discrimination against persons/groups/fields, not specific to a product, not restrict other software, technology-neutral (2007)
  • CC: licenses give individuals and orgs a standardized way to grant public permission to use creative work under copyright law
    • "What can I do with this work?"
    • CC BY: distribute, remix, adapt, build upon but must attribute (by) to creator
    • CC BY-SA: same as CC-BY but also must license modified material under same terms
    • CC BY-NC: same as CC-BY but only noncommerical uses of work
    • CC BY-ND: copy and distribute, attribution, commercial, but no derivatives or adaptations
    • CC0: public dedication tool; creator gives up copyright and puts work in the public domain; no conditions

https://creativecommons.org/2021/03/04/should-cc-licensed-content-be-used-to-train-ai-it-depends/

  • CC supports broad access to content in the public interest; greater openness for the common good
    • Broad access can help reduce bias, enhance inclusion, promote activities like education and research, foster beneficial innovation
  • Currently no consensus on whether use of copyright works to train AI is an exercise of exclusive right (e.g. reproduction, adaptation)
    • In the US, it's likely considered fair use (permits limited use of copyrighted material wtihout having to first acquire permission from copyright holder; e.g. in the public interest)
    • 2019: IBM use CC-licensed photos of faces to train facial recog
  • United Nations Secretary-General António Guterres acknowledged that “advances in artificial intelligence-related technologies, such as facial recognition software and digital identification, must not be used to erode human rights, deepen inequality or exacerbate existing discrimination.”5
  • An inclusive approach to support better sharing
    • Currently AI environment is defined by ethical concerns, lack of algo trnasparency, privatization and enclosure of outputs
    • To promote positive use of CC content, need a community-led, coordinated and inclusive approach to consider issues of accountability, responsibility, sustainability, human/cultural/personality/privacy rights, data protection

https://arxiv.org/abs/1903.12262 (Montreal Data License)

  • Taxonomy for data licensing in AI/ML

https://stackoverflow.blog/2022/08/08/can-you-stop-your-open-source-project-from-being-used-for-evil/

  • Free software foundation: “struggle against for-profit corporate control” and against restrictions on users
    • Argues that licenses must not prohibit use in torture, as this restriction is not enforceable
  • Open source can also be good for corporations for pragmatic/business reasons
  • Open Source Initiative: licenses “may not discriminate against persons or groups. Giving everyone freedom means giving evil people freedom too”
  • Ethical source not open source?
    • Ethical source movement - licenses to give developers “freedom an agency to ensure our work is being used for social good and in service of human rights”
    • Emphasise developer rights to have a say in what their labor is used for over the rights of a user to use it for anything
    • Examples: prohibit companies that violate labor laws or human rights or extract fossil fuels
    • Coraline Ada Ehmke: in traditional open source, success is generally measured on the number of adoptions especially those of large tech companies
      • Ethical source is more concerned about real-world impact of the technology; downstream ethical nature of uses the software enables and how these affect real people
  • ESL doesn’t necessarily stop anonymous users of deepfake software but it could prevent corporate misuse (corporate lawyers also actually care about/can be audited for licensing misuse)
  • In cybersecurity: focus on most harmful and likely to be exploited vulnerabilities first
    • ESL maybe can’t stope all harm, but can make some harms less likely, convenient, or more costly

https://stability.ai/blog/stable-diffusion-public-release

  • Collab with HuggingFace and CoreWeave
  • Creative ML OpenRAIL-M permissive license that allows for commercial and non-commercial use
  • Ethical and legal use as your responsibility and must accompany model distribution
  • Safety classifier included by default in the package that removes outputs that may be desired by user (params can be adjusted and welcoming input)
  • Tools: model card, dev notebook, public demo space, dreamstudio beta, discord

https://privacyprotection.substack.com/p/introduction-our-quest-to-make-a

  • Global Privacy Protection standard
  • AI/ML opened up opportunities to contextualize the web and every piece of data in an effort to surface understanding and insights in a more efficient manner
  • The fallout of this opportunity has seen emergence of industries that have profited from the use of personal information: data brokers, advertising platforms, etc.
  • BigScience PII working group: clean PII from data and mitigate PII leakage - https://bigscience.notion.site/Privacy-a54a45b5865e49769259de01712e59ef
  • Principles
    • Anonymity/Privacy
    • Autonomy
      • Consent: informed consent about PII in data
      • Contestation: request data removal or anonymization
    • Transparency
    • Inclusion/Representation
  • Open Privacy Framework - language-independent specification and set of recommendations to minimize risks of PII exposure

https://thegradient.pub/machine-learning-ethics-and-open-source-licensing/ (Part 1)

  • Why should ML be considered separate from software? Think about how it is used and mis-used
    • "the abuse of machine learning tools is serious, widespread, and systemic, not simply isolated incidents in systems that otherwise work. Even compared with other types of software (operating systems, desktop GUIs, web browsers), machine learning demonstrates a unique capacity to be systematically used in ways that can and are being used to harm individuals, vulnerable groups, and society as a whole"
  • In some situations, market action might resolve the problem without legal or regulatory intervention
    • Companies rolling back or pausing their work
    • This is, oddly enough, one of the benefits of keeping certain code under a proprietary umbrella, whether through platform access or a proprietary software license that places limitations on who is allowed to use it and how.
  • There are also responses from the field (ACM Code of Ethics + Montreal Declaration) and local/national government regulation
    • Restrictions on tech like facial recognition and on data collection practices
    • Adherence to best practices doesn’t occur voluntarily. The truth is that a significant subset of the groups looking to use machine learning are not good-faith actors
    • There is little incentive on the end user side to pay those costs [associated with building more ethical models], and therefore little incentive for the vendors of the technology to unilaterally invest in the specialists or tooling required to meet that standard
    • Relying on the market to provide an effective path for implementing ethical considerations has fallen short in several clear incidents;
      • Moral relativism: even when one group refuses to work on a technology on ethical grounds, another group is almost always willing to step into that space to pick up the work instead
  • Machine learning and data science systems are not solely composed of the algorithms needed to process the data, but also the components used to collect that data in the first place
    • Modern data collection incentives for identifying, profiling, and tracking
    • Even when laws prohibit the use of machine learning in certain situations, actors both inside and outside the government either don’t obey those laws, or deliberately obscure their use of these techniques in the first place
  • Today the options to limit harm might be lobbying for regulation, Twitter posts, paywall code to approave each user, or just stop developing (e.g. Joseph Redmon)

https://thegradient.pub/machine-learning-ethics-and-open-source-licensing-2/ (Part 2)

  • Software is licensed in a fundamentally different way compared to other forms of intellectual property
    • FOSS vs. CC-NC
    • FreedomZero: “The freedom to run the program as you wish, for any purpose” (the OSD phrases this as “no discrimination against field of endeavor”).
  • In open source, you can only have “my” in the associative sense. There is no “my” in open source. - Karl Fogel
    • If each contributor attempted to add their own use restrictions, the project would almost certainly end up bogged down under the copyright restrictions of each individual. By defaulting to open use for users, this can largely be avoided
    • However, man OSS projects rely on a few contributors (e.g. FastAI - Apache 2.0)
  • Standardization on FOSS licenses means that developers who want to share their work are left without the ability to strike a balance between openness and control that creators in other domains have.
    • In machine learning, the practical application cannot be separated from the code that is used to implement it. As a result, the way that software is licensed is fundamental to how we might consider both ownership and responsibility in machine learning. Permissive open source licensing prevents developers from even attempting to achieve downstream moderation of the way their work is used
  • In the same way that the open source movement said “we’re like free software, but with these changes,” I think we’ll end up with a new movement. For the same reasons that “open source” came up with a new name, I think the movement that will arise from today’s developers will also need a new name - Steve Klabnik (Rust)
    • software licenses that apply some level of explicit enforcement on what the copyright owner views as proper use, while maintaining many of the procedural benefits of open code like open development and source code access
  • Subcategories of special software licenses
    • “Cloud protection licenses,” for example, attempt to restrict the ability of competitors (and by “competitors”, they usually mean Amazon) to take their company’s open code and offer it as a hosted service
    • Ethical Source software focuses on reducing harm through various angles, which include respecting human rights, protection of privacy, fair compensation, and more
    • idea of building accountability within the end user base of projects under an Ethical Source license
      • e.g. Hippocratic License seeks to tie alleged human rights abuses where HL-licensed code is used to a binding arbitration agreement, where either a refusal to engage in arbitration or losing the arbitration itself will result in loss of license after a 3-month grace period.
  • ml5.js: custom license tied to a stand-alone, evolving Code of Conduct document that is maintained by a committee which will act as a judgement board in cases of evaluating potential license violations
Select a repo