[Numpy-discussion] Re: Policy on AI-generated code

2024-07-04 Thread rgoswami

From a quick look, it seems like some of these (the masked array ones) are 
trivial enough to not warrant inclusion and the ctypes snippet is obvious 
enough that copyright claims won't be an issue. In terms of broader policy I 
don't really have much to say, except that in general it is probably hopeless 
to enforce a ban on AI generated content.


--- Rohit

On 7/4/24 4:59 AM, Loïc Estève via NumPy-Discussion 
 wrote:

Hi,

in scikit-learn, more of a FYI than some kind of policy (amongst other
things it does not even mention explicitly "AI" and avoids the licence
discussion), we recently added a note in our FAQ about "fully automated
tools":
https://github.com/scikit-learn/scikit-learn/pull/29287

 From my personal experience in scikit-learn, I am very skeptical about
the quality of this kind of contributions so far ... but you know future
may well prove me very wrong.

Cheers,
Loïc

> Hi,
>
> We recently got a set of well-labeled PRs containing (reviewed)
> AI-generated code:
>
> https://github.com/numpy/numpy/pull/26827
> https://github.com/numpy/numpy/pull/26828
> https://github.com/numpy/numpy/pull/26829
> https://github.com/numpy/numpy/pull/26830
> https://github.com/numpy/numpy/pull/26831
>
> Do we have a policy on AI-generated code?   It seems to me that
> AI-code in general must be a license risk, as the AI may well generate
> code that was derived from, for example, code with a GPL-license.
>
> Cheers,
>
> Matthew
> ___
> NumPy-Discussion mailing list -- numpy-discussion@python.org
> To unsubscribe send an email to numpy-discussion-le...@python.org
> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
> Member address: loic.est...@ymail.com
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: rgosw...@quansight.com


___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com


[Numpy-discussion] Re: Policy on AI-generated code

2024-07-04 Thread Daniele Nicolodi

On 03/07/24 23:40, Matthew Brett wrote:

Hi,

We recently got a set of well-labeled PRs containing (reviewed)
AI-generated code:

https://github.com/numpy/numpy/pull/26827
https://github.com/numpy/numpy/pull/26828
https://github.com/numpy/numpy/pull/26829
https://github.com/numpy/numpy/pull/26830
https://github.com/numpy/numpy/pull/26831

Do we have a policy on AI-generated code?   It seems to me that
AI-code in general must be a license risk, as the AI may well generate
code that was derived from, for example, code with a GPL-license.


There is definitely the issue of copyright to keep in mind, but I see 
two other issues: the quality of the contributions and one moral issue.


IMHO the PR linked above are not high quality contributions: for 
example, the added examples are often redundant with each other. In my 
experience these are representative of automatically generate content: 
as there is little to no effort involved into writing it, the content is 
often repetitive and with very low information density. In the case of 
documentation, I find this very detrimental to the overall quality.


Contributions generated with AI have huge ecological and social costs. 
Encouraging AI generated contributions, especially where there is 
absolutely no need to involve AI to get to the solution, as in the 
examples above, makes the project co-responsible for these costs.


Cheers,
Dan

___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com


[Numpy-discussion] Re: Policy on AI-generated code

2024-07-04 Thread Marten van Kerkwijk
Hi All,

I agree with Dan that the actual contributions to the documentation are
of little value: it is not easy to write good documentation, with
examples that show not just the mechnanics but the purpose of the
function, i.e., go well beyond just showing some random inputs and
outputs.  And poorly constructed examples are detrimental in that they
just hide the fact that the documentation is bad.

I also second his worries about ecological and social costs.

But let me add a third issue: the costs to maintainers.  I had a quick
glance at some of those PRs when they were first posted, but basically
decided they were not worth my time to review.  For a human contributor,
I might well have decided differently, since helping someone to improve
their contribution often leads to higher quality further contributions.
But here there seems to be no such hope.

All the best,

Marten

Daniele Nicolodi  writes:

> On 03/07/24 23:40, Matthew Brett wrote:
>> Hi,
>> 
>> We recently got a set of well-labeled PRs containing (reviewed)
>> AI-generated code:
>> 
>> https://github.com/numpy/numpy/pull/26827
>> https://github.com/numpy/numpy/pull/26828
>> https://github.com/numpy/numpy/pull/26829
>> https://github.com/numpy/numpy/pull/26830
>> https://github.com/numpy/numpy/pull/26831
>> 
>> Do we have a policy on AI-generated code?   It seems to me that
>> AI-code in general must be a license risk, as the AI may well generate
>> code that was derived from, for example, code with a GPL-license.
>
> There is definitely the issue of copyright to keep in mind, but I see 
> two other issues: the quality of the contributions and one moral issue.
>
> IMHO the PR linked above are not high quality contributions: for 
> example, the added examples are often redundant with each other. In my 
> experience these are representative of automatically generate content: 
> as there is little to no effort involved into writing it, the content is 
> often repetitive and with very low information density. In the case of 
> documentation, I find this very detrimental to the overall quality.
>
> Contributions generated with AI have huge ecological and social costs. 
> Encouraging AI generated contributions, especially where there is 
> absolutely no need to involve AI to get to the solution, as in the 
> examples above, makes the project co-responsible for these costs.
>
> Cheers,
> Dan
>
> ___
> NumPy-Discussion mailing list -- numpy-discussion@python.org
> To unsubscribe send an email to numpy-discussion-le...@python.org
> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
> Member address: m...@astro.utoronto.ca
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com


[Numpy-discussion] Re: Policy on AI-generated code

2024-07-04 Thread Matthew Brett
Hi,

Sorry to top-post!  But - I wanted to bring the discussion back to
licensing.  I have great sympathy for the ecological and code-quality
concerns, but licensing is a separate question, and, it seems to me,
an urgent question.

Imagine I asked some AI to give me code to replicate a particular algorithm A.

It is perfectly possible that the AI will largely or completely
reproduce some existing GPL code for A, from its training data.  There
is no way that I could know that the AI has done that without some
substantial research.  Surely, this is a license violation of the GPL
code?   Let's say we accept that code.  Others pick up the code and
modify it for other algorithms.  The code-base gets infected with GPL
code, in a way that will make it very difficult to disentangle.

Have we consulted a copyright lawyer on this?   Specifically, have we
consulted someone who advocates the GPL?

Cheers,

Matthew

On Thu, Jul 4, 2024 at 11:27 AM Marten van Kerkwijk
 wrote:
>
> Hi All,
>
> I agree with Dan that the actual contributions to the documentation are
> of little value: it is not easy to write good documentation, with
> examples that show not just the mechnanics but the purpose of the
> function, i.e., go well beyond just showing some random inputs and
> outputs.  And poorly constructed examples are detrimental in that they
> just hide the fact that the documentation is bad.
>
> I also second his worries about ecological and social costs.
>
> But let me add a third issue: the costs to maintainers.  I had a quick
> glance at some of those PRs when they were first posted, but basically
> decided they were not worth my time to review.  For a human contributor,
> I might well have decided differently, since helping someone to improve
> their contribution often leads to higher quality further contributions.
> But here there seems to be no such hope.
>
> All the best,
>
> Marten
>
> Daniele Nicolodi  writes:
>
> > On 03/07/24 23:40, Matthew Brett wrote:
> >> Hi,
> >>
> >> We recently got a set of well-labeled PRs containing (reviewed)
> >> AI-generated code:
> >>
> >> https://github.com/numpy/numpy/pull/26827
> >> https://github.com/numpy/numpy/pull/26828
> >> https://github.com/numpy/numpy/pull/26829
> >> https://github.com/numpy/numpy/pull/26830
> >> https://github.com/numpy/numpy/pull/26831
> >>
> >> Do we have a policy on AI-generated code?   It seems to me that
> >> AI-code in general must be a license risk, as the AI may well generate
> >> code that was derived from, for example, code with a GPL-license.
> >
> > There is definitely the issue of copyright to keep in mind, but I see
> > two other issues: the quality of the contributions and one moral issue.
> >
> > IMHO the PR linked above are not high quality contributions: for
> > example, the added examples are often redundant with each other. In my
> > experience these are representative of automatically generate content:
> > as there is little to no effort involved into writing it, the content is
> > often repetitive and with very low information density. In the case of
> > documentation, I find this very detrimental to the overall quality.
> >
> > Contributions generated with AI have huge ecological and social costs.
> > Encouraging AI generated contributions, especially where there is
> > absolutely no need to involve AI to get to the solution, as in the
> > examples above, makes the project co-responsible for these costs.
> >
> > Cheers,
> > Dan
> >
> > ___
> > NumPy-Discussion mailing list -- numpy-discussion@python.org
> > To unsubscribe send an email to numpy-discussion-le...@python.org
> > https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
> > Member address: m...@astro.utoronto.ca
> ___
> NumPy-Discussion mailing list -- numpy-discussion@python.org
> To unsubscribe send an email to numpy-discussion-le...@python.org
> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
> Member address: matthew.br...@gmail.com
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com


[Numpy-discussion] Re: Policy on AI-generated code

2024-07-04 Thread Matthew Brett
Sorry - reposting from my subscribed address:

Hi,

Sorry to top-post!  But - I wanted to bring the discussion back to
licensing.  I have great sympathy for the ecological and code-quality
concerns, but licensing is a separate question, and, it seems to me,
an urgent question.

Imagine I asked some AI to give me code to replicate a particular algorithm A.

It is perfectly possible that the AI will largely or completely
reproduce some existing GPL code for A, from its training data.  There
is no way that I could know that the AI has done that without some
substantial research.  Surely, this is a license violation of the GPL
code?   Let's say we accept that code.  Others pick up the code and
modify it for other algorithms.  The code-base gets infected with GPL
code, in a way that will make it very difficult to disentangle.

Have we consulted a copyright lawyer on this?   Specifically, have we
consulted someone who advocates the GPL?

Cheers,

Matthew

On Thu, Jul 4, 2024 at 11:27 AM Marten van Kerkwijk
 wrote:
>
> Hi All,
>
> I agree with Dan that the actual contributions to the documentation are
> of little value: it is not easy to write good documentation, with
> examples that show not just the mechnanics but the purpose of the
> function, i.e., go well beyond just showing some random inputs and
> outputs.  And poorly constructed examples are detrimental in that they
> just hide the fact that the documentation is bad.
>
> I also second his worries about ecological and social costs.
>
> But let me add a third issue: the costs to maintainers.  I had a quick
> glance at some of those PRs when they were first posted, but basically
> decided they were not worth my time to review.  For a human contributor,
> I might well have decided differently, since helping someone to improve
> their contribution often leads to higher quality further contributions.
> But here there seems to be no such hope.
>
> All the best,
>
> Marten
>
> Daniele Nicolodi  writes:
>
> > On 03/07/24 23:40, Matthew Brett wrote:
> >> Hi,
> >>
> >> We recently got a set of well-labeled PRs containing (reviewed)
> >> AI-generated code:
> >>
> >> https://github.com/numpy/numpy/pull/26827
> >> https://github.com/numpy/numpy/pull/26828
> >> https://github.com/numpy/numpy/pull/26829
> >> https://github.com/numpy/numpy/pull/26830
> >> https://github.com/numpy/numpy/pull/26831
> >>
> >> Do we have a policy on AI-generated code?   It seems to me that
> >> AI-code in general must be a license risk, as the AI may well generate
> >> code that was derived from, for example, code with a GPL-license.
> >
> > There is definitely the issue of copyright to keep in mind, but I see
> > two other issues: the quality of the contributions and one moral issue.
> >
> > IMHO the PR linked above are not high quality contributions: for
> > example, the added examples are often redundant with each other. In my
> > experience these are representative of automatically generate content:
> > as there is little to no effort involved into writing it, the content is
> > often repetitive and with very low information density. In the case of
> > documentation, I find this very detrimental to the overall quality.
> >
> > Contributions generated with AI have huge ecological and social costs.
> > Encouraging AI generated contributions, especially where there is
> > absolutely no need to involve AI to get to the solution, as in the
> > examples above, makes the project co-responsible for these costs.
> >
> > Cheers,
> > Dan
> >
> > ___
> > NumPy-Discussion mailing list -- numpy-discussion@python.org
> > To unsubscribe send an email to numpy-discussion-le...@python.org
> > https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
> > Member address: m...@astro.utoronto.ca
> ___
> NumPy-Discussion mailing list -- numpy-discussion@python.org
> To unsubscribe send an email to numpy-discussion-le...@python.org
> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
> Member address: matthew.br...@gmail.com
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com


[Numpy-discussion] Re: Policy on AI-generated code

2024-07-04 Thread Ralf Gommers
On Thu, Jul 4, 2024 at 12:55 PM Matthew Brett 
wrote:

> Sorry - reposting from my subscribed address:
>
> Hi,
>
> Sorry to top-post!  But - I wanted to bring the discussion back to
> licensing.  I have great sympathy for the ecological and code-quality
> concerns, but licensing is a separate question, and, it seems to me,
> an urgent question.
>
> Imagine I asked some AI to give me code to replicate a particular
> algorithm A.
>
> It is perfectly possible that the AI will largely or completely
> reproduce some existing GPL code for A, from its training data.  There
> is no way that I could know that the AI has done that without some
> substantial research.  Surely, this is a license violation of the GPL
> code?   Let's say we accept that code.  Others pick up the code and
> modify it for other algorithms.  The code-base gets infected with GPL
> code, in a way that will make it very difficult to disentangle.
>

This is a question that's topical for all of open source, and usages of
CoPilot & co. We're not going to come to any insightful answer here that is
specific to NumPy. There's a ton of discussion in a lot of places; someone
needs to research/summarize that to move this forward. Debating it from
scratch here is unlikely to yield new arguments imho.

I agree with Rohit's: "it is probably hopeless to enforce a ban on AI
generated content". There are good ways to use AI code assistant tools and
bad ones; we in general cannot know whether AI tools were used at all by a
contributor (just like we can't know whether something was copied from
Stack Overflow), nor whether when it's done the content is derived enough
to fall under some other license. The best we can do here is add a warning
to the contributing docs and PR template about this, saying the contributor
needs to be the author so copied or AI-generated content needs to not
contain things that are complex enough to be copyrightable (none of the
linked PRs come close to this threshold).


> Have we consulted a copyright lawyer on this?   Specifically, have we
> consulted someone who advocates the GPL?
>

Not that I know of.

Cheers,
Ralf



> Cheers,
>
> Matthew
>
> On Thu, Jul 4, 2024 at 11:27 AM Marten van Kerkwijk
>  wrote:
> >
> > Hi All,
> >
> > I agree with Dan that the actual contributions to the documentation are
> > of little value: it is not easy to write good documentation, with
> > examples that show not just the mechnanics but the purpose of the
> > function, i.e., go well beyond just showing some random inputs and
> > outputs.  And poorly constructed examples are detrimental in that they
> > just hide the fact that the documentation is bad.
> >
> > I also second his worries about ecological and social costs.
> >
> > But let me add a third issue: the costs to maintainers.  I had a quick
> > glance at some of those PRs when they were first posted, but basically
> > decided they were not worth my time to review.  For a human contributor,
> > I might well have decided differently, since helping someone to improve
> > their contribution often leads to higher quality further contributions.
> > But here there seems to be no such hope.
> >
> > All the best,
> >
> > Marten
> >
> > Daniele Nicolodi  writes:
> >
> > > On 03/07/24 23:40, Matthew Brett wrote:
> > >> Hi,
> > >>
> > >> We recently got a set of well-labeled PRs containing (reviewed)
> > >> AI-generated code:
> > >>
> > >> https://github.com/numpy/numpy/pull/26827
> > >> https://github.com/numpy/numpy/pull/26828
> > >> https://github.com/numpy/numpy/pull/26829
> > >> https://github.com/numpy/numpy/pull/26830
> > >> https://github.com/numpy/numpy/pull/26831
> > >>
> > >> Do we have a policy on AI-generated code?   It seems to me that
> > >> AI-code in general must be a license risk, as the AI may well generate
> > >> code that was derived from, for example, code with a GPL-license.
> > >
> > > There is definitely the issue of copyright to keep in mind, but I see
> > > two other issues: the quality of the contributions and one moral issue.
> > >
> > > IMHO the PR linked above are not high quality contributions: for
> > > example, the added examples are often redundant with each other. In my
> > > experience these are representative of automatically generate content:
> > > as there is little to no effort involved into writing it, the content
> is
> > > often repetitive and with very low information density. In the case of
> > > documentation, I find this very detrimental to the overall quality.
> > >
> > > Contributions generated with AI have huge ecological and social costs.
> > > Encouraging AI generated contributions, especially where there is
> > > absolutely no need to involve AI to get to the solution, as in the
> > > examples above, makes the project co-responsible for these costs.
> > >
> > > Cheers,
> > > Dan
> > >
> > > ___
> > > NumPy-Discussion mailing list -- numpy-discussion@python.org
> > > To unsubscribe send an email to numpy-discu

[Numpy-discussion] Re: Policy on AI-generated code

2024-07-04 Thread Matthew Brett
Hi,

On Thu, Jul 4, 2024 at 12:20 PM Ralf Gommers  wrote:
>
>
>
> On Thu, Jul 4, 2024 at 12:55 PM Matthew Brett  wrote:
>>
>> Sorry - reposting from my subscribed address:
>>
>> Hi,
>>
>> Sorry to top-post!  But - I wanted to bring the discussion back to
>> licensing.  I have great sympathy for the ecological and code-quality
>> concerns, but licensing is a separate question, and, it seems to me,
>> an urgent question.
>>
>> Imagine I asked some AI to give me code to replicate a particular algorithm 
>> A.
>>
>> It is perfectly possible that the AI will largely or completely
>> reproduce some existing GPL code for A, from its training data.  There
>> is no way that I could know that the AI has done that without some
>> substantial research.  Surely, this is a license violation of the GPL
>> code?   Let's say we accept that code.  Others pick up the code and
>> modify it for other algorithms.  The code-base gets infected with GPL
>> code, in a way that will make it very difficult to disentangle.
>
>
> This is a question that's topical for all of open source, and usages of 
> CoPilot & co. We're not going to come to any insightful answer here that is 
> specific to NumPy. There's a ton of discussion in a lot of places; someone 
> needs to research/summarize that to move this forward. Debating it from 
> scratch here is unlikely to yield new arguments imho.

Right - I wasn't expecting a detailed discussion on the merits - only
some thoughts on policy for now.

> I agree with Rohit's: "it is probably hopeless to enforce a ban on AI 
> generated content". There are good ways to use AI code assistant tools and 
> bad ones; we in general cannot know whether AI tools were used at all by a 
> contributor (just like we can't know whether something was copied from Stack 
> Overflow), nor whether when it's done the content is derived enough to fall 
> under some other license. The best we can do here is add a warning to the 
> contributing docs and PR template about this, saying the contributor needs to 
> be the author so copied or AI-generated content needs to not contain things 
> that are complex enough to be copyrightable (none of the linked PRs come 
> close to this threshold).

Yes, these PRs are not the concern - but I believe we do need to plan
now for the future.

I agree it is hard to enforce, but it seems to me it would be a
reasonable defensive move to say - for now - that authors will need to
take full responsibility for copyright, and that, as of now,
AI-generated code cannot meet that standard, so we require authors to
turn off AI-generation when writing code for Numpy.

Cheers,

Matthew
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com


[Numpy-discussion] Re: Policy on AI-generated code

2024-07-04 Thread Bill Ross
> It is perfectly possible that the AI will largely or completely reproduce 
> some existing GPL code for A, from its training data.  There is no way that I 
> could know that the AI has done that without some substantial research. 

Even if it did, what if the common code were arrived at independently,
e.g. it wasn't used in training? 

Searching by approximate text match seems to cover similarity, maybe
requiring a legal standard for this purpose. Aside from ML, the method
I'm familiar with involves cosine similarity on an n-dimensional vector
representing counts of, say, all 5-char sequences in the text, where N
becomes ~26^5. Any licensed code would be fingerprinted and checked for
license status before being added to an official database.

Bill
--

Phobrain.com 

On 2024-07-04 03:50, Matthew Brett wrote:

> Sorry - reposting from my subscribed address:
> 
> Hi,
> 
> Sorry to top-post!  But - I wanted to bring the discussion back to
> licensing.  I have great sympathy for the ecological and code-quality
> concerns, but licensing is a separate question, and, it seems to me,
> an urgent question.
> 
> Imagine I asked some AI to give me code to replicate a particular algorithm A.
> 
> It is perfectly possible that the AI will largely or completely
> reproduce some existing GPL code for A, from its training data.  There
> is no way that I could know that the AI has done that without some
> substantial research.  Surely, this is a license violation of the GPL
> code?   Let's say we accept that code.  Others pick up the code and
> modify it for other algorithms.  The code-base gets infected with GPL
> code, in a way that will make it very difficult to disentangle.
> 
> Have we consulted a copyright lawyer on this?   Specifically, have we
> consulted someone who advocates the GPL?
> 
> Cheers,
> 
> Matthew
> 
> On Thu, Jul 4, 2024 at 11:27 AM Marten van Kerkwijk
>  wrote: 
> Hi All,
> 
> I agree with Dan that the actual contributions to the documentation are
> of little value: it is not easy to write good documentation, with
> examples that show not just the mechnanics but the purpose of the
> function, i.e., go well beyond just showing some random inputs and
> outputs.  And poorly constructed examples are detrimental in that they
> just hide the fact that the documentation is bad.
> 
> I also second his worries about ecological and social costs.
> 
> But let me add a third issue: the costs to maintainers.  I had a quick
> glance at some of those PRs when they were first posted, but basically
> decided they were not worth my time to review.  For a human contributor,
> I might well have decided differently, since helping someone to improve
> their contribution often leads to higher quality further contributions.
> But here there seems to be no such hope.
> 
> All the best,
> 
> Marten
> 
> Daniele Nicolodi  writes:
> 
> On 03/07/24 23:40, Matthew Brett wrote: Hi,
> 
> We recently got a set of well-labeled PRs containing (reviewed)
> AI-generated code:
> 
> https://github.com/numpy/numpy/pull/26827
> https://github.com/numpy/numpy/pull/26828
> https://github.com/numpy/numpy/pull/26829
> https://github.com/numpy/numpy/pull/26830
> https://github.com/numpy/numpy/pull/26831
> 
> Do we have a policy on AI-generated code?   It seems to me that
> AI-code in general must be a license risk, as the AI may well generate
> code that was derived from, for example, code with a GPL-license. 
> There is definitely the issue of copyright to keep in mind, but I see
> two other issues: the quality of the contributions and one moral issue.
> 
> IMHO the PR linked above are not high quality contributions: for
> example, the added examples are often redundant with each other. In my
> experience these are representative of automatically generate content:
> as there is little to no effort involved into writing it, the content is
> often repetitive and with very low information density. In the case of
> documentation, I find this very detrimental to the overall quality.
> 
> Contributions generated with AI have huge ecological and social costs.
> Encouraging AI generated contributions, especially where there is
> absolutely no need to involve AI to get to the solution, as in the
> examples above, makes the project co-responsible for these costs.
> 
> Cheers,
> Dan
> 
> ___
> NumPy-Discussion mailing list -- numpy-discussion@python.org
> To unsubscribe send an email to numpy-discussion-le...@python.org
> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
> Member address: m...@astro.utoronto.ca
 ___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: matthew.br...@gmail.com
___
NumPy-Discussion mailing list -- numpy-discussion@python

[Numpy-discussion] Re: Policy on AI-generated code

2024-07-04 Thread Ralf Gommers
On Thu, Jul 4, 2024 at 1:34 PM Matthew Brett 
wrote:

> Hi,
>
> On Thu, Jul 4, 2024 at 12:20 PM Ralf Gommers 
> wrote:
> >
> >
> >
> > On Thu, Jul 4, 2024 at 12:55 PM Matthew Brett 
> wrote:
> >>
> >> Sorry - reposting from my subscribed address:
> >>
> >> Hi,
> >>
> >> Sorry to top-post!  But - I wanted to bring the discussion back to
> >> licensing.  I have great sympathy for the ecological and code-quality
> >> concerns, but licensing is a separate question, and, it seems to me,
> >> an urgent question.
> >>
> >> Imagine I asked some AI to give me code to replicate a particular
> algorithm A.
> >>
> >> It is perfectly possible that the AI will largely or completely
> >> reproduce some existing GPL code for A, from its training data.  There
> >> is no way that I could know that the AI has done that without some
> >> substantial research.  Surely, this is a license violation of the GPL
> >> code?   Let's say we accept that code.  Others pick up the code and
> >> modify it for other algorithms.  The code-base gets infected with GPL
> >> code, in a way that will make it very difficult to disentangle.
> >
> >
> > This is a question that's topical for all of open source, and usages of
> CoPilot & co. We're not going to come to any insightful answer here that is
> specific to NumPy. There's a ton of discussion in a lot of places; someone
> needs to research/summarize that to move this forward. Debating it from
> scratch here is unlikely to yield new arguments imho.
>
> Right - I wasn't expecting a detailed discussion on the merits - only
> some thoughts on policy for now.
>
> > I agree with Rohit's: "it is probably hopeless to enforce a ban on AI
> generated content". There are good ways to use AI code assistant tools and
> bad ones; we in general cannot know whether AI tools were used at all by a
> contributor (just like we can't know whether something was copied from
> Stack Overflow), nor whether when it's done the content is derived enough
> to fall under some other license. The best we can do here is add a warning
> to the contributing docs and PR template about this, saying the contributor
> needs to be the author so copied or AI-generated content needs to not
> contain things that are complex enough to be copyrightable (none of the
> linked PRs come close to this threshold).
>
> Yes, these PRs are not the concern - but I believe we do need to plan
> now for the future.
>
> I agree it is hard to enforce, but it seems to me it would be a
> reasonable defensive move to say - for now - that authors will need to
> take full responsibility for copyright, and that, as of now,
> AI-generated code cannot meet that standard, so we require authors to
> turn off AI-generation when writing code for Numpy.
>

I don't think that that is any more reasonable than asking contributors to
not look at Stack Overflow at all, or to not look at any other code base
for any reason. I bet many contributors may not even know whether the
auto-complete functionality in their IDE comes from a regular language
server (see https://langserver.org/) or an AI-enhanced one.

I think the two options are:
(A) do nothing yet, wait until the tools mature to the point where they can
actually do what you're worrying about here (at which point there may be
more insight/experience in the open source community about how to deal with
the problem.
(B) add a note along the lines I suggested as an option above ("... not
contain things that are complex enough to be copyrightable ...")

Cheers,
Ralf
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com


[Numpy-discussion] Re: Policy on AI-generated code

2024-07-04 Thread Daniele Nicolodi

On 04/07/24 12:49, Matthew Brett wrote:

Hi,

Sorry to top-post!  But - I wanted to bring the discussion back to
licensing.  I have great sympathy for the ecological and code-quality
concerns, but licensing is a separate question, and, it seems to me,
an urgent question.


The licensing issue is complex and it is very likely that it will not 
get a definitive answer until a lawsuit centered around this issue is 
litigated in court. There are several lawsuits involving similar issues 
ongoing, but any resolution is likely to take several years.


Providing other, much more pragmatic and easier to gauge, reasons to 
reject AI generated contributions, I was trying to sidestep the 
licensing issue completely.


If there are other reasons why auto-generated contributions should be 
rejected, there is no need to solve the much harder problem of 
licensing: we don't want them regardless of the licensing issue.


Cheers,
Dan

___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com


[Numpy-discussion] Re: Policy on AI-generated code

2024-07-04 Thread Matthew Brett
Hi,

On Thu, Jul 4, 2024 at 3:41 PM Ralf Gommers  wrote:
>
>
>
> On Thu, Jul 4, 2024 at 1:34 PM Matthew Brett  wrote:
>>
>> Hi,
>>
>> On Thu, Jul 4, 2024 at 12:20 PM Ralf Gommers  wrote:
>> >
>> >
>> >
>> > On Thu, Jul 4, 2024 at 12:55 PM Matthew Brett  
>> > wrote:
>> >>
>> >> Sorry - reposting from my subscribed address:
>> >>
>> >> Hi,
>> >>
>> >> Sorry to top-post!  But - I wanted to bring the discussion back to
>> >> licensing.  I have great sympathy for the ecological and code-quality
>> >> concerns, but licensing is a separate question, and, it seems to me,
>> >> an urgent question.
>> >>
>> >> Imagine I asked some AI to give me code to replicate a particular 
>> >> algorithm A.
>> >>
>> >> It is perfectly possible that the AI will largely or completely
>> >> reproduce some existing GPL code for A, from its training data.  There
>> >> is no way that I could know that the AI has done that without some
>> >> substantial research.  Surely, this is a license violation of the GPL
>> >> code?   Let's say we accept that code.  Others pick up the code and
>> >> modify it for other algorithms.  The code-base gets infected with GPL
>> >> code, in a way that will make it very difficult to disentangle.
>> >
>> >
>> > This is a question that's topical for all of open source, and usages of 
>> > CoPilot & co. We're not going to come to any insightful answer here that 
>> > is specific to NumPy. There's a ton of discussion in a lot of places; 
>> > someone needs to research/summarize that to move this forward. Debating it 
>> > from scratch here is unlikely to yield new arguments imho.
>>
>> Right - I wasn't expecting a detailed discussion on the merits - only
>> some thoughts on policy for now.
>>
>> > I agree with Rohit's: "it is probably hopeless to enforce a ban on AI 
>> > generated content". There are good ways to use AI code assistant tools and 
>> > bad ones; we in general cannot know whether AI tools were used at all by a 
>> > contributor (just like we can't know whether something was copied from 
>> > Stack Overflow), nor whether when it's done the content is derived enough 
>> > to fall under some other license. The best we can do here is add a warning 
>> > to the contributing docs and PR template about this, saying the 
>> > contributor needs to be the author so copied or AI-generated content needs 
>> > to not contain things that are complex enough to be copyrightable (none of 
>> > the linked PRs come close to this threshold).
>>
>> Yes, these PRs are not the concern - but I believe we do need to plan
>> now for the future.
>>
>> I agree it is hard to enforce, but it seems to me it would be a
>> reasonable defensive move to say - for now - that authors will need to
>> take full responsibility for copyright, and that, as of now,
>> AI-generated code cannot meet that standard, so we require authors to
>> turn off AI-generation when writing code for Numpy.
>
>
> I don't think that that is any more reasonable than asking contributors to 
> not look at Stack Overflow at all, or to not look at any other code base for 
> any reason. I bet many contributors may not even know whether the 
> auto-complete functionality in their IDE comes from a regular language server 
> (see https://langserver.org/) or an AI-enhanced one.
>
> I think the two options are:
> (A) do nothing yet, wait until the tools mature to the point where they can 
> actually do what you're worrying about here (at which point there may be more 
> insight/experience in the open source community about how to deal with the 
> problem.

Have we any reason to think that the tools are not doing this now?   I
ran one of my exercises through AI many months ago, and it found and
reproduced the publicly available solution, including the comments,
verbatim.

We do agree, enforcement is difficult - but I do not think AI
autogenerated code and looking at StackOverflow are equivalent.  There
is no reasonable mechanism by which looking at StackOverflow could
result in copy-paste of a substantial block of GPL'ed (or other
unsuitably licensed) code.I do not think we have to be pure here,
just reassert - you have to own the copyright to the code, or point
the license of the place you got it.  You can't do that if you've used
AI.   Don't use AI (to the extent you can prevent it).

Cheers,

Matthew
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com


[Numpy-discussion] Re: Policy on AI-generated code

2024-07-04 Thread rgoswami

Personally, I wouldn't (as a maintainer) take a decision to reject code based 
on if I feel it is generated by AI. It is much easier to rule on the quality of 
the contribution itself, and as noted, at least so far the AI only 
contributions are very probably not going to clear the barrier of being useful 
in their own right.

We are a large, visible project, but I think it is weirder to have a policy and 
fail to enforce it (e.g. don't use NumPy for military purposes, never us an AI 
tool etc). than to not have a specific policy at all.

--- Rohit

On 7/4/24 3:03 PM, Daniele Nicolodi  wrote:

On 04/07/24 12:49, Matthew Brett wrote:
> Hi,
>
> Sorry to top-post!  But - I wanted to bring the discussion back to
> licensing.  I have great sympathy for the ecological and code-quality
> concerns, but licensing is a separate question, and, it seems to me,
> an urgent question.

The licensing issue is complex and it is very likely that it will not 
get a definitive answer until a lawsuit centered around this issue is 
litigated in court. There are several lawsuits involving similar issues 
ongoing, but any resolution is likely to take several years.


Providing other, much more pragmatic and easier to gauge, reasons to 
reject AI generated contributions, I was trying to sidestep the 
licensing issue completely.


If there are other reasons why auto-generated contributions should be 
rejected, there is no need to solve the much harder problem of 
licensing: we don't want them regardless of the licensing issue.


Cheers,
Dan

___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: rgosw...@quansight.com


___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com


[Numpy-discussion] Re: Policy on AI-generated code

2024-07-04 Thread Matthew Brett
Hi,

On Thu, Jul 4, 2024 at 4:11 PM  wrote:
>
> Personally, I wouldn't (as a maintainer) take a decision to reject code based 
> on if I feel it is generated by AI. It is much easier to rule on the quality 
> of the contribution itself, and as noted, at least so far the AI only 
> contributions are very probably not going to clear the barrier of being 
> useful in their own right.
>
> We are a large, visible project, but I think it is weirder to have a policy 
> and fail to enforce it (e.g. don't use NumPy for military purposes, never us 
> an AI tool etc). than to not have a specific policy at all.

It is not a meaningless requirement to ask contributors not to use AI,
even if it might be hard to detect.  Yes, sure, if we ask people not
to use it, sometimes they will because they don't care that we've
asked them not to, sometimes they won't realize that they've used it.
But especially in the former case, we can, and I think should, reject
that code unless they can assure us that the code is free from any
incompatible copyright.   Which case they could make, if they wanted
to.

Cheers,

Matthew
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com


[Numpy-discussion] Re: Policy on AI-generated code

2024-07-04 Thread Rohit Goswami

> Personally, I wouldn't (as a maintainer)...

Especially since I know that many potential contributors may not have 
English as their first language so stunted language / odd patterns are 
not **always** an AI indicator, sometimes its just inexperience.


-- Rohit

On 7/4/24 3:03 PM, Daniele Nicolodi wrote:

On 04/07/24 12:49, Matthew Brett wrote:

Hi,

Sorry to top-post!  But - I wanted to bring the discussion back to
licensing.  I have great sympathy for the ecological and code-quality
concerns, but licensing is a separate question, and, it seems to me,
an urgent question.


The licensing issue is complex and it is very likely that it will not 
get a definitive answer until a lawsuit centered around this issue is 
litigated in court. There are several lawsuits involving similar 
issues ongoing, but any resolution is likely to take several years.


Providing other, much more pragmatic and easier to gauge, reasons to 
reject AI generated contributions, I was trying to sidestep the 
licensing issue completely.


If there are other reasons why auto-generated contributions should be 
rejected, there is no need to solve the much harder problem of 
licensing: we don't want them regardless of the licensing issue.


Cheers,
Dan

___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: rgosw...@quansight.com___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com


[Numpy-discussion] Re: Policy on AI-generated code

2024-07-04 Thread Daniele Nicolodi

On 04/07/24 13:29, Matthew Brett wrote:

I agree it is hard to enforce, but it seems to me it would be a
reasonable defensive move to say - for now - that authors will need to
take full responsibility for copyright, and that, as of now,
AI-generated code cannot meet that standard, so we require authors to
turn off AI-generation when writing code for Numpy.


I like this position.

I wish it for be common sense for contributors to an open source 
codebase that they need to own the copyright on their contributions, but 
I don't think it can be assumed. Adding something to these lines to the 
project policy has also the potential to educate the contributions about 
the pitfalls of using AI to autocomplete their contributions.


Cheers,
Dan
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com


[Numpy-discussion] Re: Policy on AI-generated code

2024-07-04 Thread Matthew Brett
Hi,

On Thu, Jul 4, 2024 at 4:04 PM Daniele Nicolodi  wrote:
>
> On 04/07/24 12:49, Matthew Brett wrote:
> > Hi,
> >
> > Sorry to top-post!  But - I wanted to bring the discussion back to
> > licensing.  I have great sympathy for the ecological and code-quality
> > concerns, but licensing is a separate question, and, it seems to me,
> > an urgent question.
>
> The licensing issue is complex and it is very likely that it will not
> get a definitive answer until a lawsuit centered around this issue is
> litigated in court. There are several lawsuits involving similar issues
> ongoing, but any resolution is likely to take several years.

I feel sure we would want to avoid GPL code if the copyright holders
felt that we were abusing their license - regardless of whether the
court felt the copyright was realistically enforceable.

> Providing other, much more pragmatic and easier to gauge, reasons to
> reject AI generated contributions, I was trying to sidestep the
> licensing issue completely.
>
> If there are other reasons why auto-generated contributions should be
> rejected, there is no need to solve the much harder problem of
> licensing: we don't want them regardless of the licensing issue.

Let me take a different tack, but related.   We don't, as yet, have a
good feeling for the societal harm that AI will do, or its benefits.
But I imagine we can agree that AI does lead to copyright ambiguity,
and, at the moment, it offers no compelling benefit over well-crafted
human-written code.

Then - let's be defensive while we consider the copyright problem, and
wait until AI can show some benefit that is a significant
counterweight to the copyright argument.

Cheers,

Matthew
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com


[Numpy-discussion] Re: Policy on AI-generated code

2024-07-04 Thread Rohit Goswami
Doesn't the project adopting wording of this kind "pass the buck" onto 
the maintainers? At the end of the day, failure to enforce our stated 
policy will be not only the responsibility of the authors but also the 
reviewers / maintainers on whole. In effect (and just speaking 
personally) wording like this would make it less likely for me as a 
maintainer to review PRs with questionable content (potentially avoiding 
new contributors who are human but not very aware of their tools / are 
junior / new contributors), because I don't want to be implicated in 
attesting to the validity of possibly copyright infringing code. I am of 
course not against the exact wording or the spirit, and the details are 
probably best hammered out on a PR if we decide it makes sense to try to 
catch AI generated / assisted work (though I'm not sure we should)..


Perhaps not very related, but at my Uni we recently decided it took too 
much effort for us to try to make sure no one was using AI tools than to 
simply *grade*, and I think the same spirit applies here as well.


--- Rohit

On 7/4/24 3:18 PM, Daniele Nicolodi wrote:

On 04/07/24 13:29, Matthew Brett wrote:

I agree it is hard to enforce, but it seems to me it would be a
reasonable defensive move to say - for now - that authors will need to
take full responsibility for copyright, and that, as of now,
AI-generated code cannot meet that standard, so we require authors to
turn off AI-generation when writing code for Numpy.


I like this position.

I wish it for be common sense for contributors to an open source 
codebase that they need to own the copyright on their contributions, 
but I don't think it can be assumed. Adding something to these lines 
to the project policy has also the potential to educate the 
contributions about the pitfalls of using AI to autocomplete their 
contributions.


Cheers,
Dan
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: rgosw...@quansight.com___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com


[Numpy-discussion] Re: Policy on AI-generated code

2024-07-04 Thread Stefan van der Walt via NumPy-Discussion
On Thu, Jul 4, 2024, at 08:18, Daniele Nicolodi wrote:
> I wish it for be common sense for contributors to an open source 
> codebase that they need to own the copyright on their contributions, but 
> I don't think it can be assumed. Adding something to these lines to the 
> project policy has also the potential to educate the contributions about 
> the pitfalls of using AI to autocomplete their contributions.

The ultimate concern is whether GPL code lands in your open source project. 
Will instructions to the author, that they need to make sure they own copyright 
to their code, indemnify the project? I don't think so. You also cannot enforce 
such an instruction. At best, you can, during review, try and establish whether 
the author understands the code they provided; and I hope, where code of any 
complexity is involved, that that should be reasonably obvious.

You'll see we've grappled with this in scikit-image as well: 
https://github.com/scikit-image/scikit-image/pull/7429

Stéfan
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com


[Numpy-discussion] Re: Policy on AI-generated code

2024-07-04 Thread Matthew Brett
Hi,

On Thu, Jul 4, 2024 at 4:46 PM Rohit Goswami  wrote:
>
> Doesn't the project adopting wording of this kind "pass the buck" onto the 
> maintainers? At the end of the day, failure to enforce our stated policy will 
> be not only the responsibility of the authors but also the reviewers / 
> maintainers on whole. In effect (and just speaking personally) wording like 
> this would make it less likely for me as a maintainer to review PRs with 
> questionable content (potentially avoiding new contributors who are human but 
> not very aware of their tools / are junior / new contributors), because I 
> don't want to be implicated in attesting to the validity of possibly 
> copyright infringing code. I am of course not against the exact wording or 
> the spirit, and the details are probably best hammered out on a PR if we 
> decide it makes sense to try to catch AI generated / assisted work (though 
> I'm not sure we should)..
>
> Perhaps not very related, but at my Uni we recently decided it took too much 
> effort for us to try to make sure no one was using AI tools than to simply 
> *grade*, and I think the same spirit applies here as well.

Well - I think it is related in a negative way.  In work for grading,
you have a necessarily adversarial relationship between the student
and the grader.   You can (in our university, we do) say "Don't use AI
generated material in your submission", and then it's our job to
detect if they have, when they submit.However, for an open-source
project, the contributors are ourselves, and our collaborators.   We
are just asking for help in avoiding the risk of incorporating code
with incompatible copyright, and reminding contributors that this risk
is significant when using AI.Our contributors do their best
(because they are our collaborators) and we do our best (because we
have the same goals).

Cheers,

Matthew
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com


[Numpy-discussion] Re: Policy on AI-generated code

2024-07-04 Thread Ralf Gommers
On Thu, Jul 4, 2024 at 5:08 PM Matthew Brett 
wrote:

> Hi,
>
> On Thu, Jul 4, 2024 at 3:41 PM Ralf Gommers 
> wrote:
> >
> >
> >
> > On Thu, Jul 4, 2024 at 1:34 PM Matthew Brett 
> wrote:
> >>
> >> Hi,
> >>
> >> On Thu, Jul 4, 2024 at 12:20 PM Ralf Gommers 
> wrote:
> >> >
> >> >
> >> >
> >> > On Thu, Jul 4, 2024 at 12:55 PM Matthew Brett <
> matthew.br...@gmail.com> wrote:
> >> >>
> >> >> Sorry - reposting from my subscribed address:
> >> >>
> >> >> Hi,
> >> >>
> >> >> Sorry to top-post!  But - I wanted to bring the discussion back to
> >> >> licensing.  I have great sympathy for the ecological and code-quality
> >> >> concerns, but licensing is a separate question, and, it seems to me,
> >> >> an urgent question.
> >> >>
> >> >> Imagine I asked some AI to give me code to replicate a particular
> algorithm A.
> >> >>
> >> >> It is perfectly possible that the AI will largely or completely
> >> >> reproduce some existing GPL code for A, from its training data.
> There
> >> >> is no way that I could know that the AI has done that without some
> >> >> substantial research.  Surely, this is a license violation of the GPL
> >> >> code?   Let's say we accept that code.  Others pick up the code and
> >> >> modify it for other algorithms.  The code-base gets infected with GPL
> >> >> code, in a way that will make it very difficult to disentangle.
> >> >
> >> >
> >> > This is a question that's topical for all of open source, and usages
> of CoPilot & co. We're not going to come to any insightful answer here that
> is specific to NumPy. There's a ton of discussion in a lot of places;
> someone needs to research/summarize that to move this forward. Debating it
> from scratch here is unlikely to yield new arguments imho.
> >>
> >> Right - I wasn't expecting a detailed discussion on the merits - only
> >> some thoughts on policy for now.
> >>
> >> > I agree with Rohit's: "it is probably hopeless to enforce a ban on AI
> generated content". There are good ways to use AI code assistant tools and
> bad ones; we in general cannot know whether AI tools were used at all by a
> contributor (just like we can't know whether something was copied from
> Stack Overflow), nor whether when it's done the content is derived enough
> to fall under some other license. The best we can do here is add a warning
> to the contributing docs and PR template about this, saying the contributor
> needs to be the author so copied or AI-generated content needs to not
> contain things that are complex enough to be copyrightable (none of the
> linked PRs come close to this threshold).
> >>
> >> Yes, these PRs are not the concern - but I believe we do need to plan
> >> now for the future.
> >>
> >> I agree it is hard to enforce, but it seems to me it would be a
> >> reasonable defensive move to say - for now - that authors will need to
> >> take full responsibility for copyright, and that, as of now,
> >> AI-generated code cannot meet that standard, so we require authors to
> >> turn off AI-generation when writing code for Numpy.
> >
> >
> > I don't think that that is any more reasonable than asking contributors
> to not look at Stack Overflow at all, or to not look at any other code base
> for any reason. I bet many contributors may not even know whether the
> auto-complete functionality in their IDE comes from a regular language
> server (see https://langserver.org/) or an AI-enhanced one.
> >
> > I think the two options are:
> > (A) do nothing yet, wait until the tools mature to the point where they
> can actually do what you're worrying about here (at which point there may
> be more insight/experience in the open source community about how to deal
> with the problem.
>
> Have we any reason to think that the tools are not doing this now?


Yes, namely that tools aren't capable yet of generating the type of code
that would land in NumPy. And if it's literal code from some other project
for the few things that are standard (e.g., C/C++ code for a sorting
algorithm), we'd anyway judge if it was authored by the PR submitter or not
(I've caught many issues like that with large PRs from new contributors,
e.g. translating from Matlab code directly).


>I ran one of my exercises through AI many months ago, and it found and
> reproduced the publicly available solution, including the comments,
> verbatim.
>

Not close to the same, not really a relevant data point.


> We do agree, enforcement is difficult - but I do not think AI
> autogenerated code and looking at StackOverflow are equivalent.  There
> is no reasonable mechanism by which looking at StackOverflow could
> result in copy-paste of a substantial block of GPL'ed (or other
> unsuitably licensed) code.I do not think we have to be pure here,
> just reassert - you have to own the copyright to the code, or point
> the license of the place you got it.  You can't do that if you've used
> AI.   Don't use AI (to the extent you can prevent it).
>

"don't use AI" is meaninglessly broad. You ignored 

[Numpy-discussion] Re: Policy on AI-generated code

2024-07-04 Thread Stefan Krah
On Thu, Jul 04, 2024 at 04:18:03PM +0100, Matthew Brett wrote:
> I feel sure we would want to avoid GPL code if the copyright holders
> felt that we were abusing their license - regardless of whether the
> court felt the copyright was realistically enforceable.

Apologies for probably stating the obvious, but BSD code also
requires attribution either in the code itself or in the docs.

I'm told that Bing Copilot often displays links to the origin
of the generated code like Stackoverflow.  So some tools do "know"
where the code came from and recognize the general copyright issue.


Stefan Krah

___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com


[Numpy-discussion] Re: Policy on AI-generated code

2024-07-04 Thread Stefan Krah


On Thu, Jul 04, 2024 at 03:46:02PM +, Rohit Goswami wrote:
> Doesn't the project adopting wording of this kind "pass the buck" onto the
> maintainers?

I think it depends.  NetBSD's AI policy mentions the responsibility of the
committers:

https://www.netbsd.org/developers/commit-guidelines.html


Gentoo mentions the contributors:

https://wiki.gentoo.org/wiki/Project:Council/AI_policy


The latter is quite common in CLAs in general.  Also, for a very long
time PyPI had a policy that put the entire responsibility for complying
with U.S. cryptography export regulations on the uploader, no matter
where the uploader was from.


I was told in no uncertain terms that this policy was just and that
it would protect the PSF (protection of uploaders was not a concern).


Stefan Krah


___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com


[Numpy-discussion] Re: Policy on AI-generated code

2024-07-04 Thread Matthew Brett
Hi,

On Thu, Jul 4, 2024 at 7:09 PM Stefan Krah  wrote:
>
> On Thu, Jul 04, 2024 at 04:18:03PM +0100, Matthew Brett wrote:
> > I feel sure we would want to avoid GPL code if the copyright holders
> > felt that we were abusing their license - regardless of whether the
> > court felt the copyright was realistically enforceable.
>
> Apologies for probably stating the obvious, but BSD code also
> requires attribution either in the code itself or in the docs.
>
> I'm told that Bing Copilot often displays links to the origin
> of the generated code like Stackoverflow.  So some tools do "know"
> where the code came from and recognize the general copyright issue.

Yes, I think it would be totally fine to say - AI is a risk for
including copyrighted code; as far as possible, please do not use it
**.   There may be some cases where AI-generated code has clear
copyright, in which case, if you do want to submit such code, please
submit it, say what AI you used, and the copyright attribution.

Cheers,

Matthew

** See my forthcoming reply to Ralf - summary - don't use it for
anything that generates more than a line of code.
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com


[Numpy-discussion] Re: Policy on AI-generated code

2024-07-04 Thread Matthew Brett
Hi,

On Thu, Jul 4, 2024 at 6:44 PM Ralf Gommers  wrote:
>
>
>
> On Thu, Jul 4, 2024 at 5:08 PM Matthew Brett  wrote:
>>
>> Hi,
>>
>> On Thu, Jul 4, 2024 at 3:41 PM Ralf Gommers  wrote:
>> >
>> >
>> >
>> > On Thu, Jul 4, 2024 at 1:34 PM Matthew Brett  
>> > wrote:
>> >>
>> >> Hi,
>> >>
>> >> On Thu, Jul 4, 2024 at 12:20 PM Ralf Gommers  
>> >> wrote:
>> >> >
>> >> >
>> >> >
>> >> > On Thu, Jul 4, 2024 at 12:55 PM Matthew Brett  
>> >> > wrote:
>> >> >>
>> >> >> Sorry - reposting from my subscribed address:
>> >> >>
>> >> >> Hi,
>> >> >>
>> >> >> Sorry to top-post!  But - I wanted to bring the discussion back to
>> >> >> licensing.  I have great sympathy for the ecological and code-quality
>> >> >> concerns, but licensing is a separate question, and, it seems to me,
>> >> >> an urgent question.
>> >> >>
>> >> >> Imagine I asked some AI to give me code to replicate a particular 
>> >> >> algorithm A.
>> >> >>
>> >> >> It is perfectly possible that the AI will largely or completely
>> >> >> reproduce some existing GPL code for A, from its training data.  There
>> >> >> is no way that I could know that the AI has done that without some
>> >> >> substantial research.  Surely, this is a license violation of the GPL
>> >> >> code?   Let's say we accept that code.  Others pick up the code and
>> >> >> modify it for other algorithms.  The code-base gets infected with GPL
>> >> >> code, in a way that will make it very difficult to disentangle.
>> >> >
>> >> >
>> >> > This is a question that's topical for all of open source, and usages of 
>> >> > CoPilot & co. We're not going to come to any insightful answer here 
>> >> > that is specific to NumPy. There's a ton of discussion in a lot of 
>> >> > places; someone needs to research/summarize that to move this forward. 
>> >> > Debating it from scratch here is unlikely to yield new arguments imho.
>> >>
>> >> Right - I wasn't expecting a detailed discussion on the merits - only
>> >> some thoughts on policy for now.
>> >>
>> >> > I agree with Rohit's: "it is probably hopeless to enforce a ban on AI 
>> >> > generated content". There are good ways to use AI code assistant tools 
>> >> > and bad ones; we in general cannot know whether AI tools were used at 
>> >> > all by a contributor (just like we can't know whether something was 
>> >> > copied from Stack Overflow), nor whether when it's done the content is 
>> >> > derived enough to fall under some other license. The best we can do 
>> >> > here is add a warning to the contributing docs and PR template about 
>> >> > this, saying the contributor needs to be the author so copied or 
>> >> > AI-generated content needs to not contain things that are complex 
>> >> > enough to be copyrightable (none of the linked PRs come close to this 
>> >> > threshold).
>> >>
>> >> Yes, these PRs are not the concern - but I believe we do need to plan
>> >> now for the future.
>> >>
>> >> I agree it is hard to enforce, but it seems to me it would be a
>> >> reasonable defensive move to say - for now - that authors will need to
>> >> take full responsibility for copyright, and that, as of now,
>> >> AI-generated code cannot meet that standard, so we require authors to
>> >> turn off AI-generation when writing code for Numpy.
>> >
>> >
>> > I don't think that that is any more reasonable than asking contributors to 
>> > not look at Stack Overflow at all, or to not look at any other code base 
>> > for any reason. I bet many contributors may not even know whether the 
>> > auto-complete functionality in their IDE comes from a regular language 
>> > server (see https://langserver.org/) or an AI-enhanced one.
>> >
>> > I think the two options are:
>> > (A) do nothing yet, wait until the tools mature to the point where they 
>> > can actually do what you're worrying about here (at which point there may 
>> > be more insight/experience in the open source community about how to deal 
>> > with the problem.
>>
>> Have we any reason to think that the tools are not doing this now?
>
>
> Yes, namely that tools aren't capable yet of generating the type of code that 
> would land in NumPy. And if it's literal code from some other project for the 
> few things that are standard (e.g., C/C++ code for a sorting algorithm), we'd 
> anyway judge if it was authored by the PR submitter or not (I've caught many 
> issues like that with large PRs from new contributors, e.g. translating from 
> Matlab code directly).
>
>>
>>I ran one of my exercises through AI many months ago, and it found and
>> reproduced the publicly available solution, including the comments,
>> verbatim.
>
>
> Not close to the same, not really a relevant data point.

The question I was trying to address was - do we have any reason to
think that current AI will not reproduce publicly-available code
verbatim.   I don't think we do, and the example was an example of AI
doing just that.

>> We do agree, enforcement is difficult - but I do not think AI
>> autogenerated code and looking at

[Numpy-discussion] Re: Policy on AI-generated code

2024-07-04 Thread Ralf Gommers
On Thu, Jul 4, 2024 at 8:42 PM Matthew Brett 
wrote:

> Hi,
>
> On Thu, Jul 4, 2024 at 6:44 PM Ralf Gommers 
> wrote:
> >
> >
> >
> > On Thu, Jul 4, 2024 at 5:08 PM Matthew Brett 
> wrote:
> >>
> >> Hi,
> >>
> >> On Thu, Jul 4, 2024 at 3:41 PM Ralf Gommers 
> wrote:
> >> >
> >> >
> >> >
> >> > On Thu, Jul 4, 2024 at 1:34 PM Matthew Brett 
> wrote:
> >> >>
> >> >> Hi,
> >> >>
> >> >> On Thu, Jul 4, 2024 at 12:20 PM Ralf Gommers 
> wrote:
> >> >> >
> >> >> >
> >> >> >
> >> >> > On Thu, Jul 4, 2024 at 12:55 PM Matthew Brett <
> matthew.br...@gmail.com> wrote:
> >> >> >>
> >> >> >> Sorry - reposting from my subscribed address:
> >> >> >>
> >> >> >> Hi,
> >> >> >>
> >> >> >> Sorry to top-post!  But - I wanted to bring the discussion back to
> >> >> >> licensing.  I have great sympathy for the ecological and
> code-quality
> >> >> >> concerns, but licensing is a separate question, and, it seems to
> me,
> >> >> >> an urgent question.
> >> >> >>
> >> >> >> Imagine I asked some AI to give me code to replicate a particular
> algorithm A.
> >> >> >>
> >> >> >> It is perfectly possible that the AI will largely or completely
> >> >> >> reproduce some existing GPL code for A, from its training data.
> There
> >> >> >> is no way that I could know that the AI has done that without some
> >> >> >> substantial research.  Surely, this is a license violation of the
> GPL
> >> >> >> code?   Let's say we accept that code.  Others pick up the code
> and
> >> >> >> modify it for other algorithms.  The code-base gets infected with
> GPL
> >> >> >> code, in a way that will make it very difficult to disentangle.
> >> >> >
> >> >> >
> >> >> > This is a question that's topical for all of open source, and
> usages of CoPilot & co. We're not going to come to any insightful answer
> here that is specific to NumPy. There's a ton of discussion in a lot of
> places; someone needs to research/summarize that to move this forward.
> Debating it from scratch here is unlikely to yield new arguments imho.
> >> >>
> >> >> Right - I wasn't expecting a detailed discussion on the merits - only
> >> >> some thoughts on policy for now.
> >> >>
> >> >> > I agree with Rohit's: "it is probably hopeless to enforce a ban on
> AI generated content". There are good ways to use AI code assistant tools
> and bad ones; we in general cannot know whether AI tools were used at all
> by a contributor (just like we can't know whether something was copied from
> Stack Overflow), nor whether when it's done the content is derived enough
> to fall under some other license. The best we can do here is add a warning
> to the contributing docs and PR template about this, saying the contributor
> needs to be the author so copied or AI-generated content needs to not
> contain things that are complex enough to be copyrightable (none of the
> linked PRs come close to this threshold).
> >> >>
> >> >> Yes, these PRs are not the concern - but I believe we do need to plan
> >> >> now for the future.
> >> >>
> >> >> I agree it is hard to enforce, but it seems to me it would be a
> >> >> reasonable defensive move to say - for now - that authors will need
> to
> >> >> take full responsibility for copyright, and that, as of now,
> >> >> AI-generated code cannot meet that standard, so we require authors to
> >> >> turn off AI-generation when writing code for Numpy.
> >> >
> >> >
> >> > I don't think that that is any more reasonable than asking
> contributors to not look at Stack Overflow at all, or to not look at any
> other code base for any reason. I bet many contributors may not even know
> whether the auto-complete functionality in their IDE comes from a regular
> language server (see https://langserver.org/) or an AI-enhanced one.
> >> >
> >> > I think the two options are:
> >> > (A) do nothing yet, wait until the tools mature to the point where
> they can actually do what you're worrying about here (at which point there
> may be more insight/experience in the open source community about how to
> deal with the problem.
> >>
> >> Have we any reason to think that the tools are not doing this now?
> >
> >
> > Yes, namely that tools aren't capable yet of generating the type of code
> that would land in NumPy. And if it's literal code from some other project
> for the few things that are standard (e.g., C/C++ code for a sorting
> algorithm), we'd anyway judge if it was authored by the PR submitter or not
> (I've caught many issues like that with large PRs from new contributors,
> e.g. translating from Matlab code directly).
> >
> >>
> >>I ran one of my exercises through AI many months ago, and it found
> and
> >> reproduced the publicly available solution, including the comments,
> >> verbatim.
> >
> >
> > Not close to the same, not really a relevant data point.
>
> The question I was trying to address was - do we have any reason to
> think that current AI will not reproduce publicly-available code
> verbatim.   I don't think we do, and the example was an example of AI
> doing just