[Numpy-discussion] Re: Policy on AI-generated code
From a quick look, it seems like some of these (the masked array ones) are trivial enough to not warrant inclusion and the ctypes snippet is obvious enough that copyright claims won't be an issue. In terms of broader policy I don't really have much to say, except that in general it is probably hopeless to enforce a ban on AI generated content. --- Rohit On 7/4/24 4:59 AM, Loïc Estève via NumPy-Discussion wrote: Hi, in scikit-learn, more of a FYI than some kind of policy (amongst other things it does not even mention explicitly "AI" and avoids the licence discussion), we recently added a note in our FAQ about "fully automated tools": https://github.com/scikit-learn/scikit-learn/pull/29287 From my personal experience in scikit-learn, I am very skeptical about the quality of this kind of contributions so far ... but you know future may well prove me very wrong. Cheers, Loïc > Hi, > > We recently got a set of well-labeled PRs containing (reviewed) > AI-generated code: > > https://github.com/numpy/numpy/pull/26827 > https://github.com/numpy/numpy/pull/26828 > https://github.com/numpy/numpy/pull/26829 > https://github.com/numpy/numpy/pull/26830 > https://github.com/numpy/numpy/pull/26831 > > Do we have a policy on AI-generated code? It seems to me that > AI-code in general must be a license risk, as the AI may well generate > code that was derived from, for example, code with a GPL-license. > > Cheers, > > Matthew > ___ > NumPy-Discussion mailing list -- numpy-discussion@python.org > To unsubscribe send an email to numpy-discussion-le...@python.org > https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ > Member address: loic.est...@ymail.com ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: rgosw...@quansight.com ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com
[Numpy-discussion] Re: Policy on AI-generated code
On 03/07/24 23:40, Matthew Brett wrote: Hi, We recently got a set of well-labeled PRs containing (reviewed) AI-generated code: https://github.com/numpy/numpy/pull/26827 https://github.com/numpy/numpy/pull/26828 https://github.com/numpy/numpy/pull/26829 https://github.com/numpy/numpy/pull/26830 https://github.com/numpy/numpy/pull/26831 Do we have a policy on AI-generated code? It seems to me that AI-code in general must be a license risk, as the AI may well generate code that was derived from, for example, code with a GPL-license. There is definitely the issue of copyright to keep in mind, but I see two other issues: the quality of the contributions and one moral issue. IMHO the PR linked above are not high quality contributions: for example, the added examples are often redundant with each other. In my experience these are representative of automatically generate content: as there is little to no effort involved into writing it, the content is often repetitive and with very low information density. In the case of documentation, I find this very detrimental to the overall quality. Contributions generated with AI have huge ecological and social costs. Encouraging AI generated contributions, especially where there is absolutely no need to involve AI to get to the solution, as in the examples above, makes the project co-responsible for these costs. Cheers, Dan ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com
[Numpy-discussion] Re: Policy on AI-generated code
Hi All, I agree with Dan that the actual contributions to the documentation are of little value: it is not easy to write good documentation, with examples that show not just the mechnanics but the purpose of the function, i.e., go well beyond just showing some random inputs and outputs. And poorly constructed examples are detrimental in that they just hide the fact that the documentation is bad. I also second his worries about ecological and social costs. But let me add a third issue: the costs to maintainers. I had a quick glance at some of those PRs when they were first posted, but basically decided they were not worth my time to review. For a human contributor, I might well have decided differently, since helping someone to improve their contribution often leads to higher quality further contributions. But here there seems to be no such hope. All the best, Marten Daniele Nicolodi writes: > On 03/07/24 23:40, Matthew Brett wrote: >> Hi, >> >> We recently got a set of well-labeled PRs containing (reviewed) >> AI-generated code: >> >> https://github.com/numpy/numpy/pull/26827 >> https://github.com/numpy/numpy/pull/26828 >> https://github.com/numpy/numpy/pull/26829 >> https://github.com/numpy/numpy/pull/26830 >> https://github.com/numpy/numpy/pull/26831 >> >> Do we have a policy on AI-generated code? It seems to me that >> AI-code in general must be a license risk, as the AI may well generate >> code that was derived from, for example, code with a GPL-license. > > There is definitely the issue of copyright to keep in mind, but I see > two other issues: the quality of the contributions and one moral issue. > > IMHO the PR linked above are not high quality contributions: for > example, the added examples are often redundant with each other. In my > experience these are representative of automatically generate content: > as there is little to no effort involved into writing it, the content is > often repetitive and with very low information density. In the case of > documentation, I find this very detrimental to the overall quality. > > Contributions generated with AI have huge ecological and social costs. > Encouraging AI generated contributions, especially where there is > absolutely no need to involve AI to get to the solution, as in the > examples above, makes the project co-responsible for these costs. > > Cheers, > Dan > > ___ > NumPy-Discussion mailing list -- numpy-discussion@python.org > To unsubscribe send an email to numpy-discussion-le...@python.org > https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ > Member address: m...@astro.utoronto.ca ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com
[Numpy-discussion] Re: Policy on AI-generated code
Hi, Sorry to top-post! But - I wanted to bring the discussion back to licensing. I have great sympathy for the ecological and code-quality concerns, but licensing is a separate question, and, it seems to me, an urgent question. Imagine I asked some AI to give me code to replicate a particular algorithm A. It is perfectly possible that the AI will largely or completely reproduce some existing GPL code for A, from its training data. There is no way that I could know that the AI has done that without some substantial research. Surely, this is a license violation of the GPL code? Let's say we accept that code. Others pick up the code and modify it for other algorithms. The code-base gets infected with GPL code, in a way that will make it very difficult to disentangle. Have we consulted a copyright lawyer on this? Specifically, have we consulted someone who advocates the GPL? Cheers, Matthew On Thu, Jul 4, 2024 at 11:27 AM Marten van Kerkwijk wrote: > > Hi All, > > I agree with Dan that the actual contributions to the documentation are > of little value: it is not easy to write good documentation, with > examples that show not just the mechnanics but the purpose of the > function, i.e., go well beyond just showing some random inputs and > outputs. And poorly constructed examples are detrimental in that they > just hide the fact that the documentation is bad. > > I also second his worries about ecological and social costs. > > But let me add a third issue: the costs to maintainers. I had a quick > glance at some of those PRs when they were first posted, but basically > decided they were not worth my time to review. For a human contributor, > I might well have decided differently, since helping someone to improve > their contribution often leads to higher quality further contributions. > But here there seems to be no such hope. > > All the best, > > Marten > > Daniele Nicolodi writes: > > > On 03/07/24 23:40, Matthew Brett wrote: > >> Hi, > >> > >> We recently got a set of well-labeled PRs containing (reviewed) > >> AI-generated code: > >> > >> https://github.com/numpy/numpy/pull/26827 > >> https://github.com/numpy/numpy/pull/26828 > >> https://github.com/numpy/numpy/pull/26829 > >> https://github.com/numpy/numpy/pull/26830 > >> https://github.com/numpy/numpy/pull/26831 > >> > >> Do we have a policy on AI-generated code? It seems to me that > >> AI-code in general must be a license risk, as the AI may well generate > >> code that was derived from, for example, code with a GPL-license. > > > > There is definitely the issue of copyright to keep in mind, but I see > > two other issues: the quality of the contributions and one moral issue. > > > > IMHO the PR linked above are not high quality contributions: for > > example, the added examples are often redundant with each other. In my > > experience these are representative of automatically generate content: > > as there is little to no effort involved into writing it, the content is > > often repetitive and with very low information density. In the case of > > documentation, I find this very detrimental to the overall quality. > > > > Contributions generated with AI have huge ecological and social costs. > > Encouraging AI generated contributions, especially where there is > > absolutely no need to involve AI to get to the solution, as in the > > examples above, makes the project co-responsible for these costs. > > > > Cheers, > > Dan > > > > ___ > > NumPy-Discussion mailing list -- numpy-discussion@python.org > > To unsubscribe send an email to numpy-discussion-le...@python.org > > https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ > > Member address: m...@astro.utoronto.ca > ___ > NumPy-Discussion mailing list -- numpy-discussion@python.org > To unsubscribe send an email to numpy-discussion-le...@python.org > https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ > Member address: matthew.br...@gmail.com ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com
[Numpy-discussion] Re: Policy on AI-generated code
Sorry - reposting from my subscribed address: Hi, Sorry to top-post! But - I wanted to bring the discussion back to licensing. I have great sympathy for the ecological and code-quality concerns, but licensing is a separate question, and, it seems to me, an urgent question. Imagine I asked some AI to give me code to replicate a particular algorithm A. It is perfectly possible that the AI will largely or completely reproduce some existing GPL code for A, from its training data. There is no way that I could know that the AI has done that without some substantial research. Surely, this is a license violation of the GPL code? Let's say we accept that code. Others pick up the code and modify it for other algorithms. The code-base gets infected with GPL code, in a way that will make it very difficult to disentangle. Have we consulted a copyright lawyer on this? Specifically, have we consulted someone who advocates the GPL? Cheers, Matthew On Thu, Jul 4, 2024 at 11:27 AM Marten van Kerkwijk wrote: > > Hi All, > > I agree with Dan that the actual contributions to the documentation are > of little value: it is not easy to write good documentation, with > examples that show not just the mechnanics but the purpose of the > function, i.e., go well beyond just showing some random inputs and > outputs. And poorly constructed examples are detrimental in that they > just hide the fact that the documentation is bad. > > I also second his worries about ecological and social costs. > > But let me add a third issue: the costs to maintainers. I had a quick > glance at some of those PRs when they were first posted, but basically > decided they were not worth my time to review. For a human contributor, > I might well have decided differently, since helping someone to improve > their contribution often leads to higher quality further contributions. > But here there seems to be no such hope. > > All the best, > > Marten > > Daniele Nicolodi writes: > > > On 03/07/24 23:40, Matthew Brett wrote: > >> Hi, > >> > >> We recently got a set of well-labeled PRs containing (reviewed) > >> AI-generated code: > >> > >> https://github.com/numpy/numpy/pull/26827 > >> https://github.com/numpy/numpy/pull/26828 > >> https://github.com/numpy/numpy/pull/26829 > >> https://github.com/numpy/numpy/pull/26830 > >> https://github.com/numpy/numpy/pull/26831 > >> > >> Do we have a policy on AI-generated code? It seems to me that > >> AI-code in general must be a license risk, as the AI may well generate > >> code that was derived from, for example, code with a GPL-license. > > > > There is definitely the issue of copyright to keep in mind, but I see > > two other issues: the quality of the contributions and one moral issue. > > > > IMHO the PR linked above are not high quality contributions: for > > example, the added examples are often redundant with each other. In my > > experience these are representative of automatically generate content: > > as there is little to no effort involved into writing it, the content is > > often repetitive and with very low information density. In the case of > > documentation, I find this very detrimental to the overall quality. > > > > Contributions generated with AI have huge ecological and social costs. > > Encouraging AI generated contributions, especially where there is > > absolutely no need to involve AI to get to the solution, as in the > > examples above, makes the project co-responsible for these costs. > > > > Cheers, > > Dan > > > > ___ > > NumPy-Discussion mailing list -- numpy-discussion@python.org > > To unsubscribe send an email to numpy-discussion-le...@python.org > > https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ > > Member address: m...@astro.utoronto.ca > ___ > NumPy-Discussion mailing list -- numpy-discussion@python.org > To unsubscribe send an email to numpy-discussion-le...@python.org > https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ > Member address: matthew.br...@gmail.com ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com
[Numpy-discussion] Re: Policy on AI-generated code
On Thu, Jul 4, 2024 at 12:55 PM Matthew Brett wrote: > Sorry - reposting from my subscribed address: > > Hi, > > Sorry to top-post! But - I wanted to bring the discussion back to > licensing. I have great sympathy for the ecological and code-quality > concerns, but licensing is a separate question, and, it seems to me, > an urgent question. > > Imagine I asked some AI to give me code to replicate a particular > algorithm A. > > It is perfectly possible that the AI will largely or completely > reproduce some existing GPL code for A, from its training data. There > is no way that I could know that the AI has done that without some > substantial research. Surely, this is a license violation of the GPL > code? Let's say we accept that code. Others pick up the code and > modify it for other algorithms. The code-base gets infected with GPL > code, in a way that will make it very difficult to disentangle. > This is a question that's topical for all of open source, and usages of CoPilot & co. We're not going to come to any insightful answer here that is specific to NumPy. There's a ton of discussion in a lot of places; someone needs to research/summarize that to move this forward. Debating it from scratch here is unlikely to yield new arguments imho. I agree with Rohit's: "it is probably hopeless to enforce a ban on AI generated content". There are good ways to use AI code assistant tools and bad ones; we in general cannot know whether AI tools were used at all by a contributor (just like we can't know whether something was copied from Stack Overflow), nor whether when it's done the content is derived enough to fall under some other license. The best we can do here is add a warning to the contributing docs and PR template about this, saying the contributor needs to be the author so copied or AI-generated content needs to not contain things that are complex enough to be copyrightable (none of the linked PRs come close to this threshold). > Have we consulted a copyright lawyer on this? Specifically, have we > consulted someone who advocates the GPL? > Not that I know of. Cheers, Ralf > Cheers, > > Matthew > > On Thu, Jul 4, 2024 at 11:27 AM Marten van Kerkwijk > wrote: > > > > Hi All, > > > > I agree with Dan that the actual contributions to the documentation are > > of little value: it is not easy to write good documentation, with > > examples that show not just the mechnanics but the purpose of the > > function, i.e., go well beyond just showing some random inputs and > > outputs. And poorly constructed examples are detrimental in that they > > just hide the fact that the documentation is bad. > > > > I also second his worries about ecological and social costs. > > > > But let me add a third issue: the costs to maintainers. I had a quick > > glance at some of those PRs when they were first posted, but basically > > decided they were not worth my time to review. For a human contributor, > > I might well have decided differently, since helping someone to improve > > their contribution often leads to higher quality further contributions. > > But here there seems to be no such hope. > > > > All the best, > > > > Marten > > > > Daniele Nicolodi writes: > > > > > On 03/07/24 23:40, Matthew Brett wrote: > > >> Hi, > > >> > > >> We recently got a set of well-labeled PRs containing (reviewed) > > >> AI-generated code: > > >> > > >> https://github.com/numpy/numpy/pull/26827 > > >> https://github.com/numpy/numpy/pull/26828 > > >> https://github.com/numpy/numpy/pull/26829 > > >> https://github.com/numpy/numpy/pull/26830 > > >> https://github.com/numpy/numpy/pull/26831 > > >> > > >> Do we have a policy on AI-generated code? It seems to me that > > >> AI-code in general must be a license risk, as the AI may well generate > > >> code that was derived from, for example, code with a GPL-license. > > > > > > There is definitely the issue of copyright to keep in mind, but I see > > > two other issues: the quality of the contributions and one moral issue. > > > > > > IMHO the PR linked above are not high quality contributions: for > > > example, the added examples are often redundant with each other. In my > > > experience these are representative of automatically generate content: > > > as there is little to no effort involved into writing it, the content > is > > > often repetitive and with very low information density. In the case of > > > documentation, I find this very detrimental to the overall quality. > > > > > > Contributions generated with AI have huge ecological and social costs. > > > Encouraging AI generated contributions, especially where there is > > > absolutely no need to involve AI to get to the solution, as in the > > > examples above, makes the project co-responsible for these costs. > > > > > > Cheers, > > > Dan > > > > > > ___ > > > NumPy-Discussion mailing list -- numpy-discussion@python.org > > > To unsubscribe send an email to numpy-discu
[Numpy-discussion] Re: Policy on AI-generated code
Hi, On Thu, Jul 4, 2024 at 12:20 PM Ralf Gommers wrote: > > > > On Thu, Jul 4, 2024 at 12:55 PM Matthew Brett wrote: >> >> Sorry - reposting from my subscribed address: >> >> Hi, >> >> Sorry to top-post! But - I wanted to bring the discussion back to >> licensing. I have great sympathy for the ecological and code-quality >> concerns, but licensing is a separate question, and, it seems to me, >> an urgent question. >> >> Imagine I asked some AI to give me code to replicate a particular algorithm >> A. >> >> It is perfectly possible that the AI will largely or completely >> reproduce some existing GPL code for A, from its training data. There >> is no way that I could know that the AI has done that without some >> substantial research. Surely, this is a license violation of the GPL >> code? Let's say we accept that code. Others pick up the code and >> modify it for other algorithms. The code-base gets infected with GPL >> code, in a way that will make it very difficult to disentangle. > > > This is a question that's topical for all of open source, and usages of > CoPilot & co. We're not going to come to any insightful answer here that is > specific to NumPy. There's a ton of discussion in a lot of places; someone > needs to research/summarize that to move this forward. Debating it from > scratch here is unlikely to yield new arguments imho. Right - I wasn't expecting a detailed discussion on the merits - only some thoughts on policy for now. > I agree with Rohit's: "it is probably hopeless to enforce a ban on AI > generated content". There are good ways to use AI code assistant tools and > bad ones; we in general cannot know whether AI tools were used at all by a > contributor (just like we can't know whether something was copied from Stack > Overflow), nor whether when it's done the content is derived enough to fall > under some other license. The best we can do here is add a warning to the > contributing docs and PR template about this, saying the contributor needs to > be the author so copied or AI-generated content needs to not contain things > that are complex enough to be copyrightable (none of the linked PRs come > close to this threshold). Yes, these PRs are not the concern - but I believe we do need to plan now for the future. I agree it is hard to enforce, but it seems to me it would be a reasonable defensive move to say - for now - that authors will need to take full responsibility for copyright, and that, as of now, AI-generated code cannot meet that standard, so we require authors to turn off AI-generation when writing code for Numpy. Cheers, Matthew ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com
[Numpy-discussion] Re: Policy on AI-generated code
> It is perfectly possible that the AI will largely or completely reproduce > some existing GPL code for A, from its training data. There is no way that I > could know that the AI has done that without some substantial research. Even if it did, what if the common code were arrived at independently, e.g. it wasn't used in training? Searching by approximate text match seems to cover similarity, maybe requiring a legal standard for this purpose. Aside from ML, the method I'm familiar with involves cosine similarity on an n-dimensional vector representing counts of, say, all 5-char sequences in the text, where N becomes ~26^5. Any licensed code would be fingerprinted and checked for license status before being added to an official database. Bill -- Phobrain.com On 2024-07-04 03:50, Matthew Brett wrote: > Sorry - reposting from my subscribed address: > > Hi, > > Sorry to top-post! But - I wanted to bring the discussion back to > licensing. I have great sympathy for the ecological and code-quality > concerns, but licensing is a separate question, and, it seems to me, > an urgent question. > > Imagine I asked some AI to give me code to replicate a particular algorithm A. > > It is perfectly possible that the AI will largely or completely > reproduce some existing GPL code for A, from its training data. There > is no way that I could know that the AI has done that without some > substantial research. Surely, this is a license violation of the GPL > code? Let's say we accept that code. Others pick up the code and > modify it for other algorithms. The code-base gets infected with GPL > code, in a way that will make it very difficult to disentangle. > > Have we consulted a copyright lawyer on this? Specifically, have we > consulted someone who advocates the GPL? > > Cheers, > > Matthew > > On Thu, Jul 4, 2024 at 11:27 AM Marten van Kerkwijk > wrote: > Hi All, > > I agree with Dan that the actual contributions to the documentation are > of little value: it is not easy to write good documentation, with > examples that show not just the mechnanics but the purpose of the > function, i.e., go well beyond just showing some random inputs and > outputs. And poorly constructed examples are detrimental in that they > just hide the fact that the documentation is bad. > > I also second his worries about ecological and social costs. > > But let me add a third issue: the costs to maintainers. I had a quick > glance at some of those PRs when they were first posted, but basically > decided they were not worth my time to review. For a human contributor, > I might well have decided differently, since helping someone to improve > their contribution often leads to higher quality further contributions. > But here there seems to be no such hope. > > All the best, > > Marten > > Daniele Nicolodi writes: > > On 03/07/24 23:40, Matthew Brett wrote: Hi, > > We recently got a set of well-labeled PRs containing (reviewed) > AI-generated code: > > https://github.com/numpy/numpy/pull/26827 > https://github.com/numpy/numpy/pull/26828 > https://github.com/numpy/numpy/pull/26829 > https://github.com/numpy/numpy/pull/26830 > https://github.com/numpy/numpy/pull/26831 > > Do we have a policy on AI-generated code? It seems to me that > AI-code in general must be a license risk, as the AI may well generate > code that was derived from, for example, code with a GPL-license. > There is definitely the issue of copyright to keep in mind, but I see > two other issues: the quality of the contributions and one moral issue. > > IMHO the PR linked above are not high quality contributions: for > example, the added examples are often redundant with each other. In my > experience these are representative of automatically generate content: > as there is little to no effort involved into writing it, the content is > often repetitive and with very low information density. In the case of > documentation, I find this very detrimental to the overall quality. > > Contributions generated with AI have huge ecological and social costs. > Encouraging AI generated contributions, especially where there is > absolutely no need to involve AI to get to the solution, as in the > examples above, makes the project co-responsible for these costs. > > Cheers, > Dan > > ___ > NumPy-Discussion mailing list -- numpy-discussion@python.org > To unsubscribe send an email to numpy-discussion-le...@python.org > https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ > Member address: m...@astro.utoronto.ca ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: matthew.br...@gmail.com ___ NumPy-Discussion mailing list -- numpy-discussion@python
[Numpy-discussion] Re: Policy on AI-generated code
On Thu, Jul 4, 2024 at 1:34 PM Matthew Brett wrote: > Hi, > > On Thu, Jul 4, 2024 at 12:20 PM Ralf Gommers > wrote: > > > > > > > > On Thu, Jul 4, 2024 at 12:55 PM Matthew Brett > wrote: > >> > >> Sorry - reposting from my subscribed address: > >> > >> Hi, > >> > >> Sorry to top-post! But - I wanted to bring the discussion back to > >> licensing. I have great sympathy for the ecological and code-quality > >> concerns, but licensing is a separate question, and, it seems to me, > >> an urgent question. > >> > >> Imagine I asked some AI to give me code to replicate a particular > algorithm A. > >> > >> It is perfectly possible that the AI will largely or completely > >> reproduce some existing GPL code for A, from its training data. There > >> is no way that I could know that the AI has done that without some > >> substantial research. Surely, this is a license violation of the GPL > >> code? Let's say we accept that code. Others pick up the code and > >> modify it for other algorithms. The code-base gets infected with GPL > >> code, in a way that will make it very difficult to disentangle. > > > > > > This is a question that's topical for all of open source, and usages of > CoPilot & co. We're not going to come to any insightful answer here that is > specific to NumPy. There's a ton of discussion in a lot of places; someone > needs to research/summarize that to move this forward. Debating it from > scratch here is unlikely to yield new arguments imho. > > Right - I wasn't expecting a detailed discussion on the merits - only > some thoughts on policy for now. > > > I agree with Rohit's: "it is probably hopeless to enforce a ban on AI > generated content". There are good ways to use AI code assistant tools and > bad ones; we in general cannot know whether AI tools were used at all by a > contributor (just like we can't know whether something was copied from > Stack Overflow), nor whether when it's done the content is derived enough > to fall under some other license. The best we can do here is add a warning > to the contributing docs and PR template about this, saying the contributor > needs to be the author so copied or AI-generated content needs to not > contain things that are complex enough to be copyrightable (none of the > linked PRs come close to this threshold). > > Yes, these PRs are not the concern - but I believe we do need to plan > now for the future. > > I agree it is hard to enforce, but it seems to me it would be a > reasonable defensive move to say - for now - that authors will need to > take full responsibility for copyright, and that, as of now, > AI-generated code cannot meet that standard, so we require authors to > turn off AI-generation when writing code for Numpy. > I don't think that that is any more reasonable than asking contributors to not look at Stack Overflow at all, or to not look at any other code base for any reason. I bet many contributors may not even know whether the auto-complete functionality in their IDE comes from a regular language server (see https://langserver.org/) or an AI-enhanced one. I think the two options are: (A) do nothing yet, wait until the tools mature to the point where they can actually do what you're worrying about here (at which point there may be more insight/experience in the open source community about how to deal with the problem. (B) add a note along the lines I suggested as an option above ("... not contain things that are complex enough to be copyrightable ...") Cheers, Ralf ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com
[Numpy-discussion] Re: Policy on AI-generated code
On 04/07/24 12:49, Matthew Brett wrote: Hi, Sorry to top-post! But - I wanted to bring the discussion back to licensing. I have great sympathy for the ecological and code-quality concerns, but licensing is a separate question, and, it seems to me, an urgent question. The licensing issue is complex and it is very likely that it will not get a definitive answer until a lawsuit centered around this issue is litigated in court. There are several lawsuits involving similar issues ongoing, but any resolution is likely to take several years. Providing other, much more pragmatic and easier to gauge, reasons to reject AI generated contributions, I was trying to sidestep the licensing issue completely. If there are other reasons why auto-generated contributions should be rejected, there is no need to solve the much harder problem of licensing: we don't want them regardless of the licensing issue. Cheers, Dan ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com
[Numpy-discussion] Re: Policy on AI-generated code
Hi, On Thu, Jul 4, 2024 at 3:41 PM Ralf Gommers wrote: > > > > On Thu, Jul 4, 2024 at 1:34 PM Matthew Brett wrote: >> >> Hi, >> >> On Thu, Jul 4, 2024 at 12:20 PM Ralf Gommers wrote: >> > >> > >> > >> > On Thu, Jul 4, 2024 at 12:55 PM Matthew Brett >> > wrote: >> >> >> >> Sorry - reposting from my subscribed address: >> >> >> >> Hi, >> >> >> >> Sorry to top-post! But - I wanted to bring the discussion back to >> >> licensing. I have great sympathy for the ecological and code-quality >> >> concerns, but licensing is a separate question, and, it seems to me, >> >> an urgent question. >> >> >> >> Imagine I asked some AI to give me code to replicate a particular >> >> algorithm A. >> >> >> >> It is perfectly possible that the AI will largely or completely >> >> reproduce some existing GPL code for A, from its training data. There >> >> is no way that I could know that the AI has done that without some >> >> substantial research. Surely, this is a license violation of the GPL >> >> code? Let's say we accept that code. Others pick up the code and >> >> modify it for other algorithms. The code-base gets infected with GPL >> >> code, in a way that will make it very difficult to disentangle. >> > >> > >> > This is a question that's topical for all of open source, and usages of >> > CoPilot & co. We're not going to come to any insightful answer here that >> > is specific to NumPy. There's a ton of discussion in a lot of places; >> > someone needs to research/summarize that to move this forward. Debating it >> > from scratch here is unlikely to yield new arguments imho. >> >> Right - I wasn't expecting a detailed discussion on the merits - only >> some thoughts on policy for now. >> >> > I agree with Rohit's: "it is probably hopeless to enforce a ban on AI >> > generated content". There are good ways to use AI code assistant tools and >> > bad ones; we in general cannot know whether AI tools were used at all by a >> > contributor (just like we can't know whether something was copied from >> > Stack Overflow), nor whether when it's done the content is derived enough >> > to fall under some other license. The best we can do here is add a warning >> > to the contributing docs and PR template about this, saying the >> > contributor needs to be the author so copied or AI-generated content needs >> > to not contain things that are complex enough to be copyrightable (none of >> > the linked PRs come close to this threshold). >> >> Yes, these PRs are not the concern - but I believe we do need to plan >> now for the future. >> >> I agree it is hard to enforce, but it seems to me it would be a >> reasonable defensive move to say - for now - that authors will need to >> take full responsibility for copyright, and that, as of now, >> AI-generated code cannot meet that standard, so we require authors to >> turn off AI-generation when writing code for Numpy. > > > I don't think that that is any more reasonable than asking contributors to > not look at Stack Overflow at all, or to not look at any other code base for > any reason. I bet many contributors may not even know whether the > auto-complete functionality in their IDE comes from a regular language server > (see https://langserver.org/) or an AI-enhanced one. > > I think the two options are: > (A) do nothing yet, wait until the tools mature to the point where they can > actually do what you're worrying about here (at which point there may be more > insight/experience in the open source community about how to deal with the > problem. Have we any reason to think that the tools are not doing this now? I ran one of my exercises through AI many months ago, and it found and reproduced the publicly available solution, including the comments, verbatim. We do agree, enforcement is difficult - but I do not think AI autogenerated code and looking at StackOverflow are equivalent. There is no reasonable mechanism by which looking at StackOverflow could result in copy-paste of a substantial block of GPL'ed (or other unsuitably licensed) code.I do not think we have to be pure here, just reassert - you have to own the copyright to the code, or point the license of the place you got it. You can't do that if you've used AI. Don't use AI (to the extent you can prevent it). Cheers, Matthew ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com
[Numpy-discussion] Re: Policy on AI-generated code
Personally, I wouldn't (as a maintainer) take a decision to reject code based on if I feel it is generated by AI. It is much easier to rule on the quality of the contribution itself, and as noted, at least so far the AI only contributions are very probably not going to clear the barrier of being useful in their own right. We are a large, visible project, but I think it is weirder to have a policy and fail to enforce it (e.g. don't use NumPy for military purposes, never us an AI tool etc). than to not have a specific policy at all. --- Rohit On 7/4/24 3:03 PM, Daniele Nicolodi wrote: On 04/07/24 12:49, Matthew Brett wrote: > Hi, > > Sorry to top-post! But - I wanted to bring the discussion back to > licensing. I have great sympathy for the ecological and code-quality > concerns, but licensing is a separate question, and, it seems to me, > an urgent question. The licensing issue is complex and it is very likely that it will not get a definitive answer until a lawsuit centered around this issue is litigated in court. There are several lawsuits involving similar issues ongoing, but any resolution is likely to take several years. Providing other, much more pragmatic and easier to gauge, reasons to reject AI generated contributions, I was trying to sidestep the licensing issue completely. If there are other reasons why auto-generated contributions should be rejected, there is no need to solve the much harder problem of licensing: we don't want them regardless of the licensing issue. Cheers, Dan ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: rgosw...@quansight.com ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com
[Numpy-discussion] Re: Policy on AI-generated code
Hi, On Thu, Jul 4, 2024 at 4:11 PM wrote: > > Personally, I wouldn't (as a maintainer) take a decision to reject code based > on if I feel it is generated by AI. It is much easier to rule on the quality > of the contribution itself, and as noted, at least so far the AI only > contributions are very probably not going to clear the barrier of being > useful in their own right. > > We are a large, visible project, but I think it is weirder to have a policy > and fail to enforce it (e.g. don't use NumPy for military purposes, never us > an AI tool etc). than to not have a specific policy at all. It is not a meaningless requirement to ask contributors not to use AI, even if it might be hard to detect. Yes, sure, if we ask people not to use it, sometimes they will because they don't care that we've asked them not to, sometimes they won't realize that they've used it. But especially in the former case, we can, and I think should, reject that code unless they can assure us that the code is free from any incompatible copyright. Which case they could make, if they wanted to. Cheers, Matthew ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com
[Numpy-discussion] Re: Policy on AI-generated code
> Personally, I wouldn't (as a maintainer)... Especially since I know that many potential contributors may not have English as their first language so stunted language / odd patterns are not **always** an AI indicator, sometimes its just inexperience. -- Rohit On 7/4/24 3:03 PM, Daniele Nicolodi wrote: On 04/07/24 12:49, Matthew Brett wrote: Hi, Sorry to top-post! But - I wanted to bring the discussion back to licensing. I have great sympathy for the ecological and code-quality concerns, but licensing is a separate question, and, it seems to me, an urgent question. The licensing issue is complex and it is very likely that it will not get a definitive answer until a lawsuit centered around this issue is litigated in court. There are several lawsuits involving similar issues ongoing, but any resolution is likely to take several years. Providing other, much more pragmatic and easier to gauge, reasons to reject AI generated contributions, I was trying to sidestep the licensing issue completely. If there are other reasons why auto-generated contributions should be rejected, there is no need to solve the much harder problem of licensing: we don't want them regardless of the licensing issue. Cheers, Dan ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: rgosw...@quansight.com___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com
[Numpy-discussion] Re: Policy on AI-generated code
On 04/07/24 13:29, Matthew Brett wrote: I agree it is hard to enforce, but it seems to me it would be a reasonable defensive move to say - for now - that authors will need to take full responsibility for copyright, and that, as of now, AI-generated code cannot meet that standard, so we require authors to turn off AI-generation when writing code for Numpy. I like this position. I wish it for be common sense for contributors to an open source codebase that they need to own the copyright on their contributions, but I don't think it can be assumed. Adding something to these lines to the project policy has also the potential to educate the contributions about the pitfalls of using AI to autocomplete their contributions. Cheers, Dan ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com
[Numpy-discussion] Re: Policy on AI-generated code
Hi, On Thu, Jul 4, 2024 at 4:04 PM Daniele Nicolodi wrote: > > On 04/07/24 12:49, Matthew Brett wrote: > > Hi, > > > > Sorry to top-post! But - I wanted to bring the discussion back to > > licensing. I have great sympathy for the ecological and code-quality > > concerns, but licensing is a separate question, and, it seems to me, > > an urgent question. > > The licensing issue is complex and it is very likely that it will not > get a definitive answer until a lawsuit centered around this issue is > litigated in court. There are several lawsuits involving similar issues > ongoing, but any resolution is likely to take several years. I feel sure we would want to avoid GPL code if the copyright holders felt that we were abusing their license - regardless of whether the court felt the copyright was realistically enforceable. > Providing other, much more pragmatic and easier to gauge, reasons to > reject AI generated contributions, I was trying to sidestep the > licensing issue completely. > > If there are other reasons why auto-generated contributions should be > rejected, there is no need to solve the much harder problem of > licensing: we don't want them regardless of the licensing issue. Let me take a different tack, but related. We don't, as yet, have a good feeling for the societal harm that AI will do, or its benefits. But I imagine we can agree that AI does lead to copyright ambiguity, and, at the moment, it offers no compelling benefit over well-crafted human-written code. Then - let's be defensive while we consider the copyright problem, and wait until AI can show some benefit that is a significant counterweight to the copyright argument. Cheers, Matthew ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com
[Numpy-discussion] Re: Policy on AI-generated code
Doesn't the project adopting wording of this kind "pass the buck" onto the maintainers? At the end of the day, failure to enforce our stated policy will be not only the responsibility of the authors but also the reviewers / maintainers on whole. In effect (and just speaking personally) wording like this would make it less likely for me as a maintainer to review PRs with questionable content (potentially avoiding new contributors who are human but not very aware of their tools / are junior / new contributors), because I don't want to be implicated in attesting to the validity of possibly copyright infringing code. I am of course not against the exact wording or the spirit, and the details are probably best hammered out on a PR if we decide it makes sense to try to catch AI generated / assisted work (though I'm not sure we should).. Perhaps not very related, but at my Uni we recently decided it took too much effort for us to try to make sure no one was using AI tools than to simply *grade*, and I think the same spirit applies here as well. --- Rohit On 7/4/24 3:18 PM, Daniele Nicolodi wrote: On 04/07/24 13:29, Matthew Brett wrote: I agree it is hard to enforce, but it seems to me it would be a reasonable defensive move to say - for now - that authors will need to take full responsibility for copyright, and that, as of now, AI-generated code cannot meet that standard, so we require authors to turn off AI-generation when writing code for Numpy. I like this position. I wish it for be common sense for contributors to an open source codebase that they need to own the copyright on their contributions, but I don't think it can be assumed. Adding something to these lines to the project policy has also the potential to educate the contributions about the pitfalls of using AI to autocomplete their contributions. Cheers, Dan ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: rgosw...@quansight.com___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com
[Numpy-discussion] Re: Policy on AI-generated code
On Thu, Jul 4, 2024, at 08:18, Daniele Nicolodi wrote: > I wish it for be common sense for contributors to an open source > codebase that they need to own the copyright on their contributions, but > I don't think it can be assumed. Adding something to these lines to the > project policy has also the potential to educate the contributions about > the pitfalls of using AI to autocomplete their contributions. The ultimate concern is whether GPL code lands in your open source project. Will instructions to the author, that they need to make sure they own copyright to their code, indemnify the project? I don't think so. You also cannot enforce such an instruction. At best, you can, during review, try and establish whether the author understands the code they provided; and I hope, where code of any complexity is involved, that that should be reasonably obvious. You'll see we've grappled with this in scikit-image as well: https://github.com/scikit-image/scikit-image/pull/7429 Stéfan ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com
[Numpy-discussion] Re: Policy on AI-generated code
Hi, On Thu, Jul 4, 2024 at 4:46 PM Rohit Goswami wrote: > > Doesn't the project adopting wording of this kind "pass the buck" onto the > maintainers? At the end of the day, failure to enforce our stated policy will > be not only the responsibility of the authors but also the reviewers / > maintainers on whole. In effect (and just speaking personally) wording like > this would make it less likely for me as a maintainer to review PRs with > questionable content (potentially avoiding new contributors who are human but > not very aware of their tools / are junior / new contributors), because I > don't want to be implicated in attesting to the validity of possibly > copyright infringing code. I am of course not against the exact wording or > the spirit, and the details are probably best hammered out on a PR if we > decide it makes sense to try to catch AI generated / assisted work (though > I'm not sure we should).. > > Perhaps not very related, but at my Uni we recently decided it took too much > effort for us to try to make sure no one was using AI tools than to simply > *grade*, and I think the same spirit applies here as well. Well - I think it is related in a negative way. In work for grading, you have a necessarily adversarial relationship between the student and the grader. You can (in our university, we do) say "Don't use AI generated material in your submission", and then it's our job to detect if they have, when they submit.However, for an open-source project, the contributors are ourselves, and our collaborators. We are just asking for help in avoiding the risk of incorporating code with incompatible copyright, and reminding contributors that this risk is significant when using AI.Our contributors do their best (because they are our collaborators) and we do our best (because we have the same goals). Cheers, Matthew ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com
[Numpy-discussion] Re: Policy on AI-generated code
On Thu, Jul 4, 2024 at 5:08 PM Matthew Brett wrote: > Hi, > > On Thu, Jul 4, 2024 at 3:41 PM Ralf Gommers > wrote: > > > > > > > > On Thu, Jul 4, 2024 at 1:34 PM Matthew Brett > wrote: > >> > >> Hi, > >> > >> On Thu, Jul 4, 2024 at 12:20 PM Ralf Gommers > wrote: > >> > > >> > > >> > > >> > On Thu, Jul 4, 2024 at 12:55 PM Matthew Brett < > matthew.br...@gmail.com> wrote: > >> >> > >> >> Sorry - reposting from my subscribed address: > >> >> > >> >> Hi, > >> >> > >> >> Sorry to top-post! But - I wanted to bring the discussion back to > >> >> licensing. I have great sympathy for the ecological and code-quality > >> >> concerns, but licensing is a separate question, and, it seems to me, > >> >> an urgent question. > >> >> > >> >> Imagine I asked some AI to give me code to replicate a particular > algorithm A. > >> >> > >> >> It is perfectly possible that the AI will largely or completely > >> >> reproduce some existing GPL code for A, from its training data. > There > >> >> is no way that I could know that the AI has done that without some > >> >> substantial research. Surely, this is a license violation of the GPL > >> >> code? Let's say we accept that code. Others pick up the code and > >> >> modify it for other algorithms. The code-base gets infected with GPL > >> >> code, in a way that will make it very difficult to disentangle. > >> > > >> > > >> > This is a question that's topical for all of open source, and usages > of CoPilot & co. We're not going to come to any insightful answer here that > is specific to NumPy. There's a ton of discussion in a lot of places; > someone needs to research/summarize that to move this forward. Debating it > from scratch here is unlikely to yield new arguments imho. > >> > >> Right - I wasn't expecting a detailed discussion on the merits - only > >> some thoughts on policy for now. > >> > >> > I agree with Rohit's: "it is probably hopeless to enforce a ban on AI > generated content". There are good ways to use AI code assistant tools and > bad ones; we in general cannot know whether AI tools were used at all by a > contributor (just like we can't know whether something was copied from > Stack Overflow), nor whether when it's done the content is derived enough > to fall under some other license. The best we can do here is add a warning > to the contributing docs and PR template about this, saying the contributor > needs to be the author so copied or AI-generated content needs to not > contain things that are complex enough to be copyrightable (none of the > linked PRs come close to this threshold). > >> > >> Yes, these PRs are not the concern - but I believe we do need to plan > >> now for the future. > >> > >> I agree it is hard to enforce, but it seems to me it would be a > >> reasonable defensive move to say - for now - that authors will need to > >> take full responsibility for copyright, and that, as of now, > >> AI-generated code cannot meet that standard, so we require authors to > >> turn off AI-generation when writing code for Numpy. > > > > > > I don't think that that is any more reasonable than asking contributors > to not look at Stack Overflow at all, or to not look at any other code base > for any reason. I bet many contributors may not even know whether the > auto-complete functionality in their IDE comes from a regular language > server (see https://langserver.org/) or an AI-enhanced one. > > > > I think the two options are: > > (A) do nothing yet, wait until the tools mature to the point where they > can actually do what you're worrying about here (at which point there may > be more insight/experience in the open source community about how to deal > with the problem. > > Have we any reason to think that the tools are not doing this now? Yes, namely that tools aren't capable yet of generating the type of code that would land in NumPy. And if it's literal code from some other project for the few things that are standard (e.g., C/C++ code for a sorting algorithm), we'd anyway judge if it was authored by the PR submitter or not (I've caught many issues like that with large PRs from new contributors, e.g. translating from Matlab code directly). >I ran one of my exercises through AI many months ago, and it found and > reproduced the publicly available solution, including the comments, > verbatim. > Not close to the same, not really a relevant data point. > We do agree, enforcement is difficult - but I do not think AI > autogenerated code and looking at StackOverflow are equivalent. There > is no reasonable mechanism by which looking at StackOverflow could > result in copy-paste of a substantial block of GPL'ed (or other > unsuitably licensed) code.I do not think we have to be pure here, > just reassert - you have to own the copyright to the code, or point > the license of the place you got it. You can't do that if you've used > AI. Don't use AI (to the extent you can prevent it). > "don't use AI" is meaninglessly broad. You ignored
[Numpy-discussion] Re: Policy on AI-generated code
On Thu, Jul 04, 2024 at 04:18:03PM +0100, Matthew Brett wrote: > I feel sure we would want to avoid GPL code if the copyright holders > felt that we were abusing their license - regardless of whether the > court felt the copyright was realistically enforceable. Apologies for probably stating the obvious, but BSD code also requires attribution either in the code itself or in the docs. I'm told that Bing Copilot often displays links to the origin of the generated code like Stackoverflow. So some tools do "know" where the code came from and recognize the general copyright issue. Stefan Krah ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com
[Numpy-discussion] Re: Policy on AI-generated code
On Thu, Jul 04, 2024 at 03:46:02PM +, Rohit Goswami wrote: > Doesn't the project adopting wording of this kind "pass the buck" onto the > maintainers? I think it depends. NetBSD's AI policy mentions the responsibility of the committers: https://www.netbsd.org/developers/commit-guidelines.html Gentoo mentions the contributors: https://wiki.gentoo.org/wiki/Project:Council/AI_policy The latter is quite common in CLAs in general. Also, for a very long time PyPI had a policy that put the entire responsibility for complying with U.S. cryptography export regulations on the uploader, no matter where the uploader was from. I was told in no uncertain terms that this policy was just and that it would protect the PSF (protection of uploaders was not a concern). Stefan Krah ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com
[Numpy-discussion] Re: Policy on AI-generated code
Hi, On Thu, Jul 4, 2024 at 7:09 PM Stefan Krah wrote: > > On Thu, Jul 04, 2024 at 04:18:03PM +0100, Matthew Brett wrote: > > I feel sure we would want to avoid GPL code if the copyright holders > > felt that we were abusing their license - regardless of whether the > > court felt the copyright was realistically enforceable. > > Apologies for probably stating the obvious, but BSD code also > requires attribution either in the code itself or in the docs. > > I'm told that Bing Copilot often displays links to the origin > of the generated code like Stackoverflow. So some tools do "know" > where the code came from and recognize the general copyright issue. Yes, I think it would be totally fine to say - AI is a risk for including copyrighted code; as far as possible, please do not use it **. There may be some cases where AI-generated code has clear copyright, in which case, if you do want to submit such code, please submit it, say what AI you used, and the copyright attribution. Cheers, Matthew ** See my forthcoming reply to Ralf - summary - don't use it for anything that generates more than a line of code. ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com
[Numpy-discussion] Re: Policy on AI-generated code
Hi, On Thu, Jul 4, 2024 at 6:44 PM Ralf Gommers wrote: > > > > On Thu, Jul 4, 2024 at 5:08 PM Matthew Brett wrote: >> >> Hi, >> >> On Thu, Jul 4, 2024 at 3:41 PM Ralf Gommers wrote: >> > >> > >> > >> > On Thu, Jul 4, 2024 at 1:34 PM Matthew Brett >> > wrote: >> >> >> >> Hi, >> >> >> >> On Thu, Jul 4, 2024 at 12:20 PM Ralf Gommers >> >> wrote: >> >> > >> >> > >> >> > >> >> > On Thu, Jul 4, 2024 at 12:55 PM Matthew Brett >> >> > wrote: >> >> >> >> >> >> Sorry - reposting from my subscribed address: >> >> >> >> >> >> Hi, >> >> >> >> >> >> Sorry to top-post! But - I wanted to bring the discussion back to >> >> >> licensing. I have great sympathy for the ecological and code-quality >> >> >> concerns, but licensing is a separate question, and, it seems to me, >> >> >> an urgent question. >> >> >> >> >> >> Imagine I asked some AI to give me code to replicate a particular >> >> >> algorithm A. >> >> >> >> >> >> It is perfectly possible that the AI will largely or completely >> >> >> reproduce some existing GPL code for A, from its training data. There >> >> >> is no way that I could know that the AI has done that without some >> >> >> substantial research. Surely, this is a license violation of the GPL >> >> >> code? Let's say we accept that code. Others pick up the code and >> >> >> modify it for other algorithms. The code-base gets infected with GPL >> >> >> code, in a way that will make it very difficult to disentangle. >> >> > >> >> > >> >> > This is a question that's topical for all of open source, and usages of >> >> > CoPilot & co. We're not going to come to any insightful answer here >> >> > that is specific to NumPy. There's a ton of discussion in a lot of >> >> > places; someone needs to research/summarize that to move this forward. >> >> > Debating it from scratch here is unlikely to yield new arguments imho. >> >> >> >> Right - I wasn't expecting a detailed discussion on the merits - only >> >> some thoughts on policy for now. >> >> >> >> > I agree with Rohit's: "it is probably hopeless to enforce a ban on AI >> >> > generated content". There are good ways to use AI code assistant tools >> >> > and bad ones; we in general cannot know whether AI tools were used at >> >> > all by a contributor (just like we can't know whether something was >> >> > copied from Stack Overflow), nor whether when it's done the content is >> >> > derived enough to fall under some other license. The best we can do >> >> > here is add a warning to the contributing docs and PR template about >> >> > this, saying the contributor needs to be the author so copied or >> >> > AI-generated content needs to not contain things that are complex >> >> > enough to be copyrightable (none of the linked PRs come close to this >> >> > threshold). >> >> >> >> Yes, these PRs are not the concern - but I believe we do need to plan >> >> now for the future. >> >> >> >> I agree it is hard to enforce, but it seems to me it would be a >> >> reasonable defensive move to say - for now - that authors will need to >> >> take full responsibility for copyright, and that, as of now, >> >> AI-generated code cannot meet that standard, so we require authors to >> >> turn off AI-generation when writing code for Numpy. >> > >> > >> > I don't think that that is any more reasonable than asking contributors to >> > not look at Stack Overflow at all, or to not look at any other code base >> > for any reason. I bet many contributors may not even know whether the >> > auto-complete functionality in their IDE comes from a regular language >> > server (see https://langserver.org/) or an AI-enhanced one. >> > >> > I think the two options are: >> > (A) do nothing yet, wait until the tools mature to the point where they >> > can actually do what you're worrying about here (at which point there may >> > be more insight/experience in the open source community about how to deal >> > with the problem. >> >> Have we any reason to think that the tools are not doing this now? > > > Yes, namely that tools aren't capable yet of generating the type of code that > would land in NumPy. And if it's literal code from some other project for the > few things that are standard (e.g., C/C++ code for a sorting algorithm), we'd > anyway judge if it was authored by the PR submitter or not (I've caught many > issues like that with large PRs from new contributors, e.g. translating from > Matlab code directly). > >> >>I ran one of my exercises through AI many months ago, and it found and >> reproduced the publicly available solution, including the comments, >> verbatim. > > > Not close to the same, not really a relevant data point. The question I was trying to address was - do we have any reason to think that current AI will not reproduce publicly-available code verbatim. I don't think we do, and the example was an example of AI doing just that. >> We do agree, enforcement is difficult - but I do not think AI >> autogenerated code and looking at
[Numpy-discussion] Re: Policy on AI-generated code
On Thu, Jul 4, 2024 at 8:42 PM Matthew Brett wrote: > Hi, > > On Thu, Jul 4, 2024 at 6:44 PM Ralf Gommers > wrote: > > > > > > > > On Thu, Jul 4, 2024 at 5:08 PM Matthew Brett > wrote: > >> > >> Hi, > >> > >> On Thu, Jul 4, 2024 at 3:41 PM Ralf Gommers > wrote: > >> > > >> > > >> > > >> > On Thu, Jul 4, 2024 at 1:34 PM Matthew Brett > wrote: > >> >> > >> >> Hi, > >> >> > >> >> On Thu, Jul 4, 2024 at 12:20 PM Ralf Gommers > wrote: > >> >> > > >> >> > > >> >> > > >> >> > On Thu, Jul 4, 2024 at 12:55 PM Matthew Brett < > matthew.br...@gmail.com> wrote: > >> >> >> > >> >> >> Sorry - reposting from my subscribed address: > >> >> >> > >> >> >> Hi, > >> >> >> > >> >> >> Sorry to top-post! But - I wanted to bring the discussion back to > >> >> >> licensing. I have great sympathy for the ecological and > code-quality > >> >> >> concerns, but licensing is a separate question, and, it seems to > me, > >> >> >> an urgent question. > >> >> >> > >> >> >> Imagine I asked some AI to give me code to replicate a particular > algorithm A. > >> >> >> > >> >> >> It is perfectly possible that the AI will largely or completely > >> >> >> reproduce some existing GPL code for A, from its training data. > There > >> >> >> is no way that I could know that the AI has done that without some > >> >> >> substantial research. Surely, this is a license violation of the > GPL > >> >> >> code? Let's say we accept that code. Others pick up the code > and > >> >> >> modify it for other algorithms. The code-base gets infected with > GPL > >> >> >> code, in a way that will make it very difficult to disentangle. > >> >> > > >> >> > > >> >> > This is a question that's topical for all of open source, and > usages of CoPilot & co. We're not going to come to any insightful answer > here that is specific to NumPy. There's a ton of discussion in a lot of > places; someone needs to research/summarize that to move this forward. > Debating it from scratch here is unlikely to yield new arguments imho. > >> >> > >> >> Right - I wasn't expecting a detailed discussion on the merits - only > >> >> some thoughts on policy for now. > >> >> > >> >> > I agree with Rohit's: "it is probably hopeless to enforce a ban on > AI generated content". There are good ways to use AI code assistant tools > and bad ones; we in general cannot know whether AI tools were used at all > by a contributor (just like we can't know whether something was copied from > Stack Overflow), nor whether when it's done the content is derived enough > to fall under some other license. The best we can do here is add a warning > to the contributing docs and PR template about this, saying the contributor > needs to be the author so copied or AI-generated content needs to not > contain things that are complex enough to be copyrightable (none of the > linked PRs come close to this threshold). > >> >> > >> >> Yes, these PRs are not the concern - but I believe we do need to plan > >> >> now for the future. > >> >> > >> >> I agree it is hard to enforce, but it seems to me it would be a > >> >> reasonable defensive move to say - for now - that authors will need > to > >> >> take full responsibility for copyright, and that, as of now, > >> >> AI-generated code cannot meet that standard, so we require authors to > >> >> turn off AI-generation when writing code for Numpy. > >> > > >> > > >> > I don't think that that is any more reasonable than asking > contributors to not look at Stack Overflow at all, or to not look at any > other code base for any reason. I bet many contributors may not even know > whether the auto-complete functionality in their IDE comes from a regular > language server (see https://langserver.org/) or an AI-enhanced one. > >> > > >> > I think the two options are: > >> > (A) do nothing yet, wait until the tools mature to the point where > they can actually do what you're worrying about here (at which point there > may be more insight/experience in the open source community about how to > deal with the problem. > >> > >> Have we any reason to think that the tools are not doing this now? > > > > > > Yes, namely that tools aren't capable yet of generating the type of code > that would land in NumPy. And if it's literal code from some other project > for the few things that are standard (e.g., C/C++ code for a sorting > algorithm), we'd anyway judge if it was authored by the PR submitter or not > (I've caught many issues like that with large PRs from new contributors, > e.g. translating from Matlab code directly). > > > >> > >>I ran one of my exercises through AI many months ago, and it found > and > >> reproduced the publicly available solution, including the comments, > >> verbatim. > > > > > > Not close to the same, not really a relevant data point. > > The question I was trying to address was - do we have any reason to > think that current AI will not reproduce publicly-available code > verbatim. I don't think we do, and the example was an example of AI > doing just