Does the AGPL apply when distilling data into your AI model?
A very interesting question popped up in the Podlodka chat.[1] I formulated it close to the original as follows:
If we train our model based on the output of another model (when distilling it), available under AGPL-3.0, then should our model also be under AGPL or can it be licensed under another license (for example, BSD)?
That's a cool question. And it immediately calls a number of related questions, which also need to be answered in order to prepare an answer to the first one:
- What are the limits of the AGPL application: software only or not? Is it applicable to AI models and their outputs?
- What is an AI model as intellectual property: software, database, other?
- What is the output of the model: part of the software, database, or other?
- What is distillation (in the context of model learning)? Is there any action within this operation to copy data (or software code) from one model to another? Is there data extraction from a database[2] here?
Interestingly, there has already been an analysis of a similar issue (about the YOLOv8 code and weights), where the question was raised whether the trained models should be considered as a part of the software or output data. As we can see, they have not come to an unanimous verdict. This is understandable: the Ultralytics representative follows the interests of the company without weakening its position in making money from sales of their commercial licenses and leaving them with their answers to maneuver for the future in case of potential disputes, so that his answers will not be to its detriment in the future if they turn out to have more legal freedom for users of their products, than in their other actions.
To find the answer to this question, let's still go down the path of finding answers to related ones.
The legal context around the AGPL and AI models
To answer, it is necessary to take into account the existing context on the legal side of the issue. I think it's worth outlining this context first, which I was guided by, so that the further course of my reasoning would be more understandable:
- Surrounding circumstances. A legal comment on such issue largely depends on (a) which jurisdiction we are talking about (since in different countries the norms of laws and legal interpretations may vary greatly), (b) which side of the conflict is accompanied by a lawyer (i.e., in which position to base the answer – who is taking the data or who is taking whether it needs to be stopped or justified as acceptable), (c) what exactly is technically happening in reality.
- What do I mean by distillation? What is described in the Ultralytics glossary, the well-known article by Hinton et al. (2015), in these blog and tutorial. It is difficult for me to learn the technical details of the above, but in the legal context, the following seemed important: (1) in this process, data is transferred from one model to another; (2) it is saved for the student model (so as not to contact the teacher model later).
- What do I mean by models? There is no clear general approach yet as to what kind of objects they are under copyright law, and again, see par. (1). But my comments below can apply only for Russia (as a jurisdiction), so for simplicity we assume that the model is a computer program (art. 1261 of the Civil Code of the Russian Federation, the “RCC”) + database(s) (art. 1260 of the RCC).[3],[4]
- Difficulties with the AGPL. Although the AGPL is a copyleft license, as I know, it often stumbles in legal disputes.[5] And as its text itself is pretty confusing, so is the court practice on it.[6] And again, the question of its application depends on par. (1) above.
- Open source and authorship. In Russia, it is sometimes difficult for software developers to prove their authorship on it. Including due to issues of using a third-party open source code in own program.[7]
- Software patents and algorithms. I leave questions about patents out of the study (because this is an additional large legal layer, and it also depends on the jurisdiction). But let me remind that the AGPL has section 11 on patents. Perhaps it already has answers to some of your questions.
- Risks. Since there is no clear answer in the law and established court practice, we are dealing with the risks of legal uncertainty. Moreover, both possible violations of third-party's intellectual rights (when using third-party's data for their model) and possible difficulties in proving violations of your own rights (if someone uses your model, and you have problems proving that you really have rights to the model and its elements, including third-party’s: open source, distilled data, UGC, etc.).
Thus, the reasoning below is a search for an answer to the specified question, taking into account this context above. Starting from it, we should keep in mind the following aspects.
Intermediate knowledge (coming to the answer)
- (a) The concept of software, database. In Russia, software is "... a set of data and commands <...> in order to obtain a certain result, including preparatory materials obtained during development ... and the audiovisual displays generated by it" (art. 1261 of the RCC).[8] Under the database is "... a collection of independent materials <...>, systematized in such a way that these materials can be found and processed using <...> computers." (art. 1260 of the RCC). In other words, data is included in the software, and materials (also data)[9] that can be stored outside the software are included in the database.
- (b) Databases included in a model. I include databases in a model (as an object of rights), because, for example, the same weights (as a set of parameters), questions and answers (pairs of user queries and model responses, including the "correct" ones in text, vector or other form) can be recognized as databases (in the legal sense).
- (c) Audiovisual displays. What is meant by this? The RCC (as is often the case) does not disclose the concept. At the same time, this is not the same as an audiovisual work (art. 1263 of the RCC), since it is called differently. It is clear that by default we mean the graphical interface (GUI) of the program itself[10] (since there is software without it), but it cannot be excluded that multimedia output data (images, videos) can also be attributed to this concept.
- (d) "Infection" of the output data, extraction from a database. I agree with Gemini's answer, but (a) it applies to (A)GPL (see FAQ about this), and the copyright holders can block this with their additional terms (ToS, EULA, etc.);[11] (b) there are additional clarifications in the same FAQ when such "infection" with the GPL license is possible (it will also apply to the AGPL). And yes, this is consistent with the rule of law that no one has the right to extract materials (=data) from a database and use them without the permission of the copyright holder, with the exceptions provided for in the law. To extract = to transfer all its contents or an essential part of it to another medium by any technical means and in any form (art. 1334 of the RCC).
- (e) Extracted data (from the first database) as part of the second database is a use of the first database. Due to legal uncertainty, as well as based on the interpretation of the concept of "extracting data from a database", it is possible that in a court dispute, the copyright holder of the teacher model (which includes the first database from which data was taken into the second one during distillation) may argue that a significant part of the data was illegally extracted from his database,[12] and this data was stored in student model (which includes the second database).
- (f) Blurring of the materiality criterion. The laws do not clarify[13] what is considered an essential part of the database in relation to extraction (i.e. there is no rule about conditional 5-10% of the total volume of the database as an acceptable value for extraction / copying without the consent of its copyright holder). This will be determined by the court situationally, based on the circumstances of a particular dispute.[14]
- (g) The AGPL limits. The AGPL explicitly states in its preamble that it applies not only to software code. Therefore, we can assume that the AGPL will apply to everything that is in the repository, the content of which has an explicit indication (for example, in the README file) that it is licensed under the AGPL. But again, we come up against the question (see above) whether the output data (which is not in the repository, but is obtained as a result of the work of the contents of the AGPL repository) is subject to the AGPL.
All this leads us to the following.
We formulate the answer (and its assumptions)
I would like to note, there is a number of conditional assumptions in the answer. For example, I proceed from the fact that (1) the legal qualification of the model is software + database,[15] (2) the AGPL repository applies to all files in it, (3) the output is not an audiovisual display (as understood by art. 1261 of the RCC), (4) the exceptions described in the GPL FAQ do not apply to our issue, (5) if an extraction from a database (which is part of the model) occurs, then it is not possible (or difficult) to determine the amount of extracted data (relative to the total amount of data in it).[16]
In total, taking into account the above context and assumptions, it turned out the following.
Possible answer
- If only the output data of an AGPL model is used, and there is no use of its code, weights (parameters), or other elements contained as is in the AGPL repository as part of its model, then there is no obligation to license its model under the AGPL too.[17]
- At the same time, it is worth making sure that there are no accompanying conditions, restrictions or rules from the copyright holder of the AGPL model regarding the application of its output data (for example, in the form of the EULA, ToU, ToS or even rules in README).[18]
- For reliability, you can also get direct explanations (that you do not need to apply the AGPL) from the copyright holder of the model whose output data will be used.[19]
- However, the question remains whether the output data of the AGPL model collected (during distillation) can also be considered as extraction from databases (which, together with the AGPL code, are part of the model and are in the repository). If we assume that, in accordance with the Russian laws, we may encounter the fact of using third-party's database (the copyright holder of the distilled model) within the meaning of art. 1334 of the RCC. And then we face with 2-ways fork: (1) if the output contains fragments of files (code, model weights, or other elements) in the same form as they are in the repository under the AGPL (as is), then we can say that the AGPL applies to them (and to the rest IP), see par. 1 above; (2) if the output does not contain, then there is a high probability that this situation will be qualified as the analogy mentioned (when the terms of the GPL will not automatically apply to the book), but taking into account par. 2 of the answer and assumption (4) above.[20]
Afterword
Naturally, my comments above and the possible answer are just legal reflections out loud, not legal advice. It is noteworthy that even Gemini, in its response, hedge its own copyright holder bets: "... your new model will most likely not be considered a derivative work ...", "It is highly recommended to consult a lawyer ...", "With a high degree of probability, you can use ...". Well, Google developers follow the same path as the guys from Ultralytics: minimize the risks of our company, add disclaimers, and wait for requests for a commercial license.
And yes, there are other wonderful questions right there. For example:
- Are weights considered objects protected by copyright? If it's just mathematical formulas.
- How can copyrights be taken into account for the result of combining different models (such as WhisperSpeech)? What will it be, and how to take into account the copyrights and restrictions of the copyright holders of the models that formed the basis?
- Since computer programs include data, should they be allocated as part of models and databases (as separate objects of rights), or should all elements of the model be recognized only as software?
- What exactly is meant by audiovisual displays? Should images and videos generated by the neural network be classified as such? If so, then why divide the legal qualification between text data (as not falling under the concept of audiovisual) and images, videos as output data of models?
- How quickly will we get to litigation over the reverse engineering of models and the examples of cleanroom as successful examples of defensive positions in such disputes?
But their analysis is the subject of research and legal analysis within the framework of other materials.
Aftertaste
Finding a solution to this issue is a legal quest (difficult but fascinating). Naturally, there is no perception of the task as completed – rather, as climbing a mountain opens up a view of other (higher) mountains of the range. And a similar feeling: not hardcore, but close.
I caught myself thinking that presenting the material is like generating an AI response – well, my neural network also thought about it and gave a result. But this was not originally planned: it just quickly became clear that the analysis of the issue would go beyond the format of a convenient answer in a telegram chat, and a description of the context and assumptions would make it possible to more clearly show why and how I have come to such answer.
- Podlodka (“a Submarine” in English) is a popular Russian-speaking podcast and community around it (with its chat and channel in Russian). ↩︎ 
- All this text is about Russian laws application to the issue (see the details below). So, here I mean the term, definition of that is concluded in art. 1334 of the RCC. For your convenience, you can find the English translation of this law e.g. here. But, in this translation this term translated as “retrieval of materials”. ↩︎ 
- I am not taking into account complex objects, composite works, multimedia products, and other works (which are among the heroes of the RCC). The assessment of their applicability for the legal qualification of the essence of models is the topic of separate articles. But, of course, to apply to a model (like intellectual property), the construction of a composite product suggests itself. And we have been waiting for the law to change in terms of what relates to a complex object (otherwise it turns out that some "other audiovisual work" is such, but digital products that are much more complex in production and content (AI models, online services, software, etc.) are not). ↩︎ 
- As we all understand, there may be repositories in open source where only data is available, without program code. However, a set of data published in open source (for example, weights, other parameters) may also be part of the software, see below for the definition of the term “a software” (as understood by the RCC). ↩︎ 
- For example, we recall the MongoDB license transition – they did it for certain reasons. ↩︎ 
- One of the newest significant cases related to AGPL is Neo4j v PureThink (USA). ↩︎ 
- For example, the court case A. Mamichev v. Veeam Software (his former employer). ↩︎ 
- I need to note, my own English translation of the RCC articles (in this text) may vary from the English translation provided by me (through linking) to you for your convenience, see my note [2] above. ↩︎ 
- It is interesting that the law does not use the concept of data in the definition of the term "database" (instead, materials), but this is not critical in the context of the current discussion. Moreover, this gap can be filled by court practice. ↩︎ 
- By the way, this is probably one of the reasons why Russian courts persistently classify video games as computer programs rather than multimedia products (art. 1240 of the RCC): after all, it can be argued that the game interface is just an audiovisual display. ↩︎ 
- I mean, not to rewrite the terms of the AGPL (then in this case it will no longer be the AGPL license, but another AGPL–based derivative license – for example, MongoDB's SSPL), but to make additional ones in the form of another document (for example, the terms of data use). Perhaps, in this case, there may be a violation of the terms of the AGPL by the copyright holder, but this may still become another barrier that must be overcome by the developer of the student model (in order to prove the legality of using data from the teacher model). ↩︎ 
- For example, if there is a failure to comply with any the AGPL condition by the developer of the student model, which has been infused with the data obtained during distillation. ↩︎ 
- Since the comments in the response relate only to Russian jurisdiction (see par. (1) above), we are talking only about Russian legislation (the RCC (Part 4) and other laws). ↩︎ 
- However, the copyright holder of the database (whose data is used to train his own teacher model) can also spend a lot of time and effort proving that his rights have been violated. The litigation between Vkontakte and Double Data confirms this. ↩︎ 
- Based on my note [4] above. ↩︎ 
- A fair question would be: if there is a data file in the repository, and its volume is known (in GB), is it possible to calculate the amount of output data received in relation to such a file? There is logic in it, but there is a problem: this data may be a derived part from another database. ↩︎ 
- Again, if the case under discussion does not fall under the exceptions described in the GPL FAQ (see above). ↩︎ 
- Based on my note [11] above. ↩︎ 
- However, as we can see, copyright holders are not always ready to give the answer that users expect, see the example from YOLO8 above. ↩︎ 
- Of course, the issue becomes more interesting if the output collected during distillation (not in as is form) is recognized as a derivative work (in relation to Russia – art. 1260 of the RCC). Such recognition will strengthen the position of the copyright holders of the teacher models. Within the framework of this material, I am also not considering this approach now, so as not to complicate the analyzed issue. ↩︎ 
Materials that can help advance this issue and related topics:
I didn't read while my preparing the answer, but had this in bookmarks and glad to share
An impressive analysis of the AGPL license and its ambiguities
P.S. I thank to my friends (developers and analysts) who gave me the necessary explanations to my questions that arose during this quest, and Valentina D. for help with editing.