There are several datasets for AI and ML research that uses source code from open source projects. The models trained on these datasets often don’t comply with these licenses, that often require attribution of their authors, and in some cases, requires that any new projects using the code to be licensed under the same license as the original.
You can’t generally just add license terms to an open-source license. At that point, it is not anymore an open-source license, but rather your own custom (a.k.a. proprietary) license.
As in, there’s a list of license texts that are approved by the Open Source Initiative and you don’t really want to deviate from that. (There’s also a list by the Free Software Foundation for the more freedom-loving among us, which is rather similar and also valid.)
This also has larger legal implications. There’s been lawsuits for open-source licenses, to which you can point and tell a company to fuck off, if they do a similar violation. As soon as you start adding own terms, there can be contradictions and just generally surface to attack.
In particular also, most code exists in the form of libraries. If you’re a library and you want users, you do want to stick to the well-known licenses, because no one wants to deal with each library having different custom terms (considering you can easily end up using hundreds of libraries in an application).
Currently, if it can be accessed publicly, it can be used to train a model regardless of what it says on any kind of license or terms of service.
Mostly enforcement. Many repositories have put anti-AI licenses on their code, but much like LLM companies violations of copyright law, it’s really difficult to prove they’re pulling the repositories and using them for training. Notoriously, FOSS projects don’t have a lot of money lying around to hire legal teams to go on fishing expeditions in a lengthy discovery process on the hopes someone was dumb enough to admit it in writing.
Which datasets are not complying? Call them out by name. Gotta keep them accountable.
I think basically every datased used to train modern AI and ML (from 2021 and beyond). They usually use code from GitHub and other code hosting sites.
Yeah but which one? Can you name one?
I’m not too much familiar with the concept, but The Pile maybe?
According to TechCrunch, that dataset is built off of Public Domain works.
The Common Pile v0.1, which can be downloaded from Hugging Face’s AI dev platform and GitHub, was created in consultation with legal experts, and it draws on sources, including 300,000 public domain books digitized by the Library of Congress and the Internet Archive. EleutherAI also used Whisper, OpenAI’s open source speech-to-text model, to transcribe audio content.
Honestly, I don’t think I care if people use my code to train AI models. Assuming they would even respect a license disallowing them, I don’t think me holding out my code is going to change the reality of the situation.
I published my code for people to (hopefully) use. If they want to use it to train their models, I don’t think that really deprives me of anything. And anyone can leverage my code with equal access.
EDIT: Ironically, my personal forgejo instance is under relentless (presumably AI) crawling sucking up bandwidth at a truely pointless rate.
FOSS projects that are covered by a copyleft license can only prohibit the sort of derivative work that they could haul someone into court to stop.
LLM training advocates argue that their use is technically “educational” and so they feel justified in ignoring copyright, and since none of these companies have been destroyed by lawsuits they’re essentially right.
And that’s ignoring that there’s probably a license grant in the GitHub TOS that can be read as giving permission for activities that reasonably include LLM training.




