There are several datasets for AI and ML research that uses source code from open source projects. The models trained on these datasets often don’t comply with these licenses, that often require attribution of their authors, and in some cases, requires that any new projects using the code to be licensed under the same license as the original.

  • trem@lemmy.blahaj.zone
    link
    fedilink
    English
    arrow-up
    2
    arrow-down
    1
    ·
    5 小时前

    You can’t generally just add license terms to an open-source license. At that point, it is not anymore an open-source license, but rather your own custom (a.k.a. proprietary) license.

    As in, there’s a list of license texts that are approved by the Open Source Initiative and you don’t really want to deviate from that. (There’s also a list by the Free Software Foundation for the more freedom-loving among us, which is rather similar and also valid.)

    This also has larger legal implications. There’s been lawsuits for open-source licenses, to which you can point and tell a company to fuck off, if they do a similar violation. As soon as you start adding own terms, there can be contradictions and just generally surface to attack.

    In particular also, most code exists in the form of libraries. If you’re a library and you want users, you do want to stick to the well-known licenses, because no one wants to deal with each library having different custom terms (considering you can easily end up using hundreds of libraries in an application).

  • Grimy@lemmy.world
    link
    fedilink
    arrow-up
    9
    ·
    11 小时前

    Currently, if it can be accessed publicly, it can be used to train a model regardless of what it says on any kind of license or terms of service.

  • CombatWombat@feddit.online
    link
    fedilink
    English
    arrow-up
    14
    ·
    13 小时前

    Mostly enforcement. Many repositories have put anti-AI licenses on their code, but much like LLM companies violations of copyright law, it’s really difficult to prove they’re pulling the repositories and using them for training. Notoriously, FOSS projects don’t have a lot of money lying around to hire legal teams to go on fishing expeditions in a lengthy discovery process on the hopes someone was dumb enough to admit it in writing.

  • 9tr6gyp3@lemmy.world
    link
    fedilink
    English
    arrow-up
    12
    ·
    15 小时前

    Which datasets are not complying? Call them out by name. Gotta keep them accountable.

  • treadful@lemmy.zip
    link
    fedilink
    English
    arrow-up
    7
    arrow-down
    1
    ·
    edit-2
    13 小时前

    Honestly, I don’t think I care if people use my code to train AI models. Assuming they would even respect a license disallowing them, I don’t think me holding out my code is going to change the reality of the situation.

    I published my code for people to (hopefully) use. If they want to use it to train their models, I don’t think that really deprives me of anything. And anyone can leverage my code with equal access.

    EDIT: Ironically, my personal forgejo instance is under relentless (presumably AI) crawling sucking up bandwidth at a truely pointless rate.

  • DomeGuy@lemmy.world
    link
    fedilink
    arrow-up
    2
    ·
    11 小时前

    FOSS projects that are covered by a copyleft license can only prohibit the sort of derivative work that they could haul someone into court to stop.

    LLM training advocates argue that their use is technically “educational” and so they feel justified in ignoring copyright, and since none of these companies have been destroyed by lawsuits they’re essentially right.

    And that’s ignoring that there’s probably a license grant in the GitHub TOS that can be read as giving permission for activities that reasonably include LLM training.