On Tue, 29 Apr 2025 at 08:02, Matthias Urlichs <[email protected]> wrote:

> However² IMHO we need to distinguish between things like gnubg or
> tesseract, and today's LLMs or similar "large" models.
>
Yes, that could be a useful thing to do.

> We can, absent no copyright restrictions, more-or-less-easily recreate the
> former's models from their training data.
>
> We can't do that with LLMs or similar-sized models, even if we had source
> code.
>
> Their developers create a model's architecture, presumably some
> Python-or-whatever source code and/or a descriptive language, which *is*
> their source. We don't get that. This source gets compiled to whatever (we
> also don't get these binaries). The result is then run in training mode on
> a large corpus which Debian can't distribute (a) for copyright reasons but
> also (b) because it's too damn large, end up with a base model which they
> don't give us either and which gets tweaked by further training and human
> feedback (partly by poorly-paid gig workers in developing countries), then
> distilled down to manageable size (but still too large for us to distribute
> in many cases).
>
The proposals so far have been to agree to ship the end-model inside
Debian, if all the software used in the training process is DFSG-free and
if the training process is documented (or scripted with included DFSG-free
scripts) and if the training corpus is also either available or shipped
with the source. Technically it should not be a big problem to capture the
training corpus of a model and save it before the training starts - it all
needs to be downloaded and packaged for the training program to consume
anyway. Where problems start are the legal (and technical) hurdles in
redistributing this training corpus as such assembly of direct copies of
copyrightable works would be encumbered by the copyrights of its individual
parts. The source *could* be out there, but we as Debian would not have the
legal rights to redistribute it. Even if any developer would have the
rights to acquire such data set and use it to train the model (if they had
sufficient resources at their disposal).

This basically raises a fundamental question of whether the training data
actually is source code. Or if it needs a different legal and technical
definition with different rules for handling it. The copyright law seems to
currently be interpreted so that training data is *not* the same as the
source code. IMHO we should not do that as well.

The question of how to handle the additional training that involves model
refinement using humans was not considered at all so far. This could be
made to be DFSG-free both from the licensing and from testing protocol
perspective. But practical reproducibility would then have similar barriers
as needing billions of dollars worth of hardware and millions of dollars
worth of electricity per reproduction for the model itself.

So our choice is basically between shipping something we don't control and
> can't introspect, and, well, not doing so.
>
> There is no third choice of distributing a free alternative, because even
> if we get the architecture's source code and aside from the copyright issue
> and the humongous-size issue and the multiple-manual-build-steps issue and
> the shouldn't-we-save-energy-dammit issue there's the looming problem that
> almost(?) none of us have even remotely enough GPUs to reproduce the
> resulting model in the first place.
>
> My vote is on not doing so. We might want to ship the requisite tools in
> contrib and let people download the models from huggingface, but that's as
> far as I want to take Debian in that direction.
>
That might be the most likely outcome. But in that case IMHO it would be of
benefit for the community guidance to differentiate between what we see as
free AI (and what not) and what subset of those free AI models we consider
practically includable in Debian archive with additional criteria to input
data packaging and availability as well as what amount of resources are
needed for re-creation of the model (with a limit to what Debian can
actually have in terms of hardware and afford to spend in terms of
power/rental costs if needed).

Debian has historically been a very important community voice in defining
clear criteria and targets that the rest of the community then used to
rally around. Starting with DFSG and analysis of specific copyright
licenses and to more recent projects like reproducible builds.

-- 
Best regards,
    Aigars Mahinovs

Reply via email to