Answers to frequently asked questions
What is an open-source AI?
An open-source AI is a system made freely available with all code, data and parameters necessary to grant the freedom to use, study, modify, and share it.
Why did you write the open-source AI definition?
The AI Act exempts "free and open-source AI systems" from "some of the most onerous requirements of technical documentation and the attendant scientific and legal scrutiny" designed to protect people and democracies from the risks posed by AI black boxes.
Its pubblication led OSI to a convoluted design ("co-design") process that produced a flawed definition. Such definition, if taken seriously in a court of law, might open a loophole in the European legislation, allowing black boxes trained from "unshareable" data to pass as Open Source while providing no transparency or accountability.
Even the current OSI President, Stefano Maffulli, confirmed at Open Source Summit Europe that "data is essential for understanding and studying the system".
Bruce Perens, OSI co-founder and creator of the Open Source Definition, argued that there is no need for a "Open Source AI definition" because "you can apply the original Open Source Definition to machine learning and I feel it works better. You need to apply it to two pieces: the software (and possibly specialized hardware), and the training data. The training data is “source code” for the purposes of the Open Source definition."
Inspired by that insight but concerned about the damage that a flawed definition would cause, we wrote the open-source AI definition to preserve people's safety, security and trust in the open-source movement.
Who are you?
Just researchers, developers, and organizations who signed the open-source AI definition.
What's the difference between the Open Source Definition and the open-source AI definition?
The Open Source Definition (OSD) applies to individual software programs.
The open-source AI definition applies to more complex systems composed of several parts that can be encoded in different forms, such as code, data, weights and so on.
What's the difference between the "Open Source AI Definition" and the open-source AI definition?
"Open Source AI Definition" and "OSAID" refer to the output of the convoluted design ("co-design") process run by OSI. By allowing the use of "unshareable data" that inhibit any meaningful study of the system, it turns black boxes into "Open Source AI".
The open-source AI definition has been written by the open-source community.
The hyphen marks the difference to properly match the transparency assumed by the AI Act and by similar regulations world-wide.
What is the role of training data in the open-source AI definition?
Open-source means giving anyone the ability to meaningfully study and modify your system, without requiring additional permissions, for any purpose. This is why OSD #2 requires that the source code must be provided in the preferred form for making modifications. This way everyone has the same rights and abilities as the original developers, starting a virtuous cycle of innovation.
Source Data are essential for understanding and studying many AI systems, because they are to parameters what source code is to a binary executable.
Training and cross-validation data are part of such data that also include any random value used during their processing to compute the parameters.
Why do you forbid the exclusion of any data?
For several reasons.
First, for many AI systems (e.g. based on artificial neural networks) no meaningful freedom to study is possible without the data used to compute the parameters of the system. For example, it's not possible to verify the generalization abilities of a LLM without knowing if the response to a specific prompt was part of the training corpora.
Second, many generative AI systems encode and output long fragments of the training data. This might expose the users to subtle violations of the rights of third parties.
For example, a user might accidentaly publish personal data used to train the system without the consent of the data subject, or distribute plagiaristic outputs in violation of copyright law.
We want to open-source AI systems to be safe and trustworthy for the users and the communities that will adopt them, so we require full transparency and we mitigate the legal risks by allowing only findable, accessible, interoperable, and reusable data that are properly licensed and legally collected.
How can a LLM qualify as open-source AI?
Obviously, it depends on how it was realized on the first place, but a simple LLM trained from data contained in Common Crawl could qualify as open-source AI by sharing:
- Source Data: the findable, accessible, interoperable, and reusable reference to the Common Crawl data actually used in training and cross-validation.
- Source Code: the source code of each software used to retrieve the Source Data, train the LLM and run it, including the scripts used to automate the whole process, under licenses complaint with the Open Source Definition.
- Processing Information: the documentation about how the Source Data were collected and used to create the LLM, so that a skilled person can recreate a copy of the LLM from the Source Data and Source Code, for example under CC BY 4.0.
- Parameters: the model parameters, such as weights or other configuration settings, under terms that grant the four freedoms.
What about federated learning or similar techniques?
Federated learning (also known as collaborative learning) is a sub-field of machine learning focusing on settings in which multiple entities collaboratively train a model while ensuring that their data remains decentralized and preserving data privacy, data minimization, and data access rights.
A system built through federated learning can qualify as open-source AI if and only if the data actually used in training can also be shared. Concretely, this means that:
- the data must be properly anonymized before being used for the training and they must be shared in the network with the computed parameters;
- when legally required, the distribution of such data under terms that grant the four freedoms must be allowed by the people to whom the data refer, through explicit consent;
- when legally required, the distribution of such data under terms that grant the four freedoms must be permitted by the copyright holders.
This way, both the freedom to study and to modify the system will be granted to any user of the system without compromising privacy and data protection rights.
Whenever such distribution is not possibile, and more generally when an AI system cannot fully grant the four freedoms, it cannot qualify as an open-source AI.
Why don't you define AI or machine learning?
The Open Source Definition did not define software.
Won't this relegate open-source AI to a niche?
No more and no less than how the Open Source Definition relegated open-source software to a niche when it was written, in 1998.
Is the open-source AI definition covering models and weights and parameters?
Yes.