Do we have a consensus on this topic or is there still further discussion to be had?
On Thu, Jul 24, 2025, at 8:26 AM, David Capwell wrote: > Given the above, code generated in whole or in part using AI can be > contributed if the contributor ensures that: The terms and conditions of the > generative AI tool do not place any restrictions on use of the output that > would be inconsistent with the Open Source Definition. At least one of the > following conditions is met: The output is not copyrightable subject matter > (and would not be even if produced by a human). No third party materials are > included in the output. Any third party materials that are included in the > output are being used with permission (e.g., under a compatible open-source > license) of the third party copyright holders and in compliance with the > applicable license terms. A contributor obtains reasonable certainty that > conditions 2.2 or 2.3 are met if the AI tool itself provides sufficient > information about output that may be similar to training data, or from code > scanning results > ASF Generative Tooling Guidance > <https://www.apache.org/legal/generative-tooling.html#:~:text=Given%20the%20above,code%20scanning%20results> > apache.org > <https://www.apache.org/legal/generative-tooling.html#:~:text=Given%20the%20above,code%20scanning%20results> > apple-touch-icon-180x180.png > <https://www.apache.org/legal/generative-tooling.html#:~:text=Given%20the%20above,code%20scanning%20results> > > > Ariel shared this at the start. Right now we must know what tool was used so > we can make sure its license is ok. The only tool currently flagged as not > acceptable is OpenAI as it has wordings limiting what you may do with its > output. > > Sent from my iPhone > >> On Jul 23, 2025, at 1:31 PM, Jon Haddad <j...@rustyrazorblade.com> wrote: >> >> +1 to Patrick's proposal. >> >> On Wed, Jul 23, 2025 at 12:37 PM Patrick McFadin <pmcfa...@gmail.com> wrote: >>> I just did some review on all the case law around copywrite and AI code. So >>> far, every claim has been dismissed. There are some other cases like >>> NYTimes which have more merit and are proceeding. >>> >>> Which leads me to the opinion that this is feeling like a premature >>> optimization. Somebody creating a PR should not have to also submit a SBOM, >>> which is essentially what we’re asking. It’s undue burden and friction on >>> the process when we should be looking for ways to reduce friction. >>> >>> My proposal is no disclosures required. >>> >>> On Wed, Jul 23, 2025 at 12:06 PM Yifan Cai <yc25c...@gmail.com> wrote: >>>> According to the thread, the disclosure is for legal purposes. For >>>> example, the patch is not produced by OpenAI's service. I think having the >>>> discussion to clarify the AI usage in the projects is meaningful. I guess >>>> many are hesitating because of the unclarity in the area. >>>> >>>> > I don’t believe or agree with us assuming we should do this for every PR >>>> >>>> I am with you, David. Updating the mail list for PRs is overwhelming for >>>> both the author and the community. >>>> >>>> I also do not feel co-author is the best place. >>>> >>>> - Yifan >>>> >>>> On Wed, Jul 23, 2025 at 11:51 AM Patrick McFadin <pmcfa...@gmail.com> >>>> wrote: >>>>> This is starting to get ridiculous. Disclosure statements on exactly how >>>>> a problem was solved? What’s next? Time cards? >>>>> >>>>> It’s time to accept the world as it is. AI is in the coding toolbox now >>>>> just like IDEs, linters and code formatters. Some may not like using >>>>> them, some may love using them. What matters is that a problem was >>>>> solved, the code matches whatever quality standard the project upholds >>>>> which should be enforced by testing and code reviews. >>>>> >>>>> Patrick >>>>> >>>>> On Wed, Jul 23, 2025 at 11:31 AM David Capwell <dcapw...@apple.com> wrote: >>>>>>> David is disclosing it in the maillist and the GH page. Should the >>>>>>> disclosure be persisted in the commit? >>>>>> >>>>>> Someone asked me to update the ML, but I don’t believe or agree with us >>>>>> assuming we should do this for every PR; personally storing this in the >>>>>> PR description is fine to me as you are telling the reviewers (who you >>>>>> need to communicate this to). >>>>>> >>>>>> >>>>>>> I’d say we can use the co-authored part of our commit messages to >>>>>>> disclose the actual AI that was used? >>>>>> >>>>>> Heh... I kinda feel dirty doing that… No one does that when they take >>>>>> something from a blog or stack overflow, but when you do that you should >>>>>> still attribute by linking… which I guess is what Co-Authored does? >>>>>> >>>>>> I don’t know… feels dirty... >>>>>> >>>>>> >>>>>>> On Jul 23, 2025, at 11:19 AM, Bernardo Botella >>>>>>> <conta...@bernardobotella.com> wrote: >>>>>>> >>>>>>> That’s a great point. I’d say we can use the co-authored part of our >>>>>>> commit messages to disclose the actual AI that was used? >>>>>>> >>>>>>> >>>>>>> >>>>>>>> On Jul 23, 2025, at 10:57 AM, Yifan Cai <yc25c...@gmail.com> wrote: >>>>>>>> >>>>>>>> Curious, what are the good ways to disclose the information? >>>>>>>> >>>>>>>> > All of which comes back to: if people disclose if they used AI, what >>>>>>>> > models, and whether they used the code or text the model wrote >>>>>>>> > verbatim or used it as a scaffolding and then heavily modified >>>>>>>> > everything I think we'll be in a pretty good spot. >>>>>>>> >>>>>>>> David is disclosing it in the maillist and the GH page. Should the >>>>>>>> disclosure be persisted in the commit? >>>>>>>> >>>>>>>> - Yifan >>>>>>>> >>>>>>>> On Wed, Jul 23, 2025 at 8:47 AM David Capwell <dcapw...@apple.com> >>>>>>>> wrote: >>>>>>>>> Sent out this patch that was written 100% by Claude: >>>>>>>>> https://github.com/apache/cassandra/pull/4266 >>>>>>>>> >>>>>>>>> Claudes license doesn’t have issues with the current ASF policy as >>>>>>>>> far as I can tell. If you look at the patch it’s very clear there >>>>>>>>> isn’t any copywriter material (its glueing together C* classes). >>>>>>>>> >>>>>>>>> I could have written this my self but I had to focus on code reviews >>>>>>>>> and also needed this patch out, so asked Claude to write it for me so >>>>>>>>> I could focus on reviews. I have reviewed it myself and it’s >>>>>>>>> basically the same code I would have written (notice how small and >>>>>>>>> focused the patch is, larger stuff doesn’t normally pass my peer >>>>>>>>> review). >>>>>>>>> >>>>>>>>>> On Jun 25, 2025, at 2:37 PM, David Capwell <dcapw...@apple.com> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>> +1 to what Josh said >>>>>>>>>> Sent from my iPhone >>>>>>>>>> >>>>>>>>>>> On Jun 25, 2025, at 1:18 PM, Josh McKenzie <jmcken...@apache.org> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>> Did some more digging. Apparently the way a lot of >>>>>>>>>>> headline-grabbers have been making models reproduce code verbatim >>>>>>>>>>> is to prompt them with dozens of verbatim tokens of copyrighted >>>>>>>>>>> code as input where completion is then very heavily weighted to >>>>>>>>>>> regurgitate the initial implementation. Which makes sense; if you >>>>>>>>>>> copy/paste 100 lines of copyrighted code, the statistically likely >>>>>>>>>>> completion for that will be that initial implementation. >>>>>>>>>>> >>>>>>>>>>> For local LLM's, the likelihood of verbatim reproduction is >>>>>>>>>>> *differently* but apparently comparably unlikely because they have >>>>>>>>>>> far fewer parameters (32B vs. 671B for Deepseek for instance) of >>>>>>>>>>> their pre-training corpus of trillions (30T in the case of >>>>>>>>>>> Qwen3-32B for instance), so the individual tokens from the >>>>>>>>>>> copyrighted material are highly unlikely to be actually *stored* in >>>>>>>>>>> the model to be reproduced, and certainly not in sequence. They >>>>>>>>>>> don't have the post-generation checks claimed by the SOTA models, >>>>>>>>>>> but are apparently considered in the "< 1 in 10,000 completions >>>>>>>>>>> will generate copyrighted code" territory. >>>>>>>>>>> >>>>>>>>>>> When asked a human language prompt, or a multi-agent pipelined >>>>>>>>>>> "still human language but from your architect agent" prompt, the >>>>>>>>>>> likelihood of producing a string of copyrighted code in that manner >>>>>>>>>>> is statistically very, very low. I think we're at far more risk of >>>>>>>>>>> contributors copy/pasting stack overflow or code from other >>>>>>>>>>> projects than we are from modern genAI models producing blocks of >>>>>>>>>>> copyrighted code. >>>>>>>>>>> >>>>>>>>>>> All of which comes back to: if people disclose if they used AI, >>>>>>>>>>> what models, and whether they used the code or text the model wrote >>>>>>>>>>> verbatim or used it as a scaffolding and then heavily modified >>>>>>>>>>> everything I think we'll be in a pretty good spot. >>>>>>>>>>> >>>>>>>>>>> On Wed, Jun 25, 2025, at 12:47 PM, David Capwell wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> 2. Models that do not do output filtering to restrict the >>>>>>>>>>>>>> reproduction of training data unless the tool can ensure the >>>>>>>>>>>>>> output is license compatible? >>>>>>>>>>>>>> >>>>>>>>>>>>>> 2 would basically prohibit locally run models. >>>>>>>>>>>> >>>>>>>>>>>> I am not for this for the reasons listed above. There isn’t a >>>>>>>>>>>> difference between this and a contributor copying code and sending >>>>>>>>>>>> our way. We still need to validate the code can be accepted . >>>>>>>>>>>> >>>>>>>>>>>> We also have the issue of having this be a broad stroke. If the >>>>>>>>>>>> user asked a model to write a test for the code the human wrote, >>>>>>>>>>>> we reject the contribution as they used a local model? This poses >>>>>>>>>>>> very little copywriting risk yet our policy would now reject >>>>>>>>>>>> >>>>>>>>>>>> Sent from my iPhone >>>>>>>>>>>> >>>>>>>>>>>>> On Jun 25, 2025, at 9:10 AM, Ariel Weisberg <ar...@weisberg.ws> >>>>>>>>>>>>> wrote: >>>>>>>>>>>>> 2. Models that do not do output filtering to restrict the >>>>>>>>>>>>> reproduction of training data unless the tool can ensure the >>>>>>>>>>>>> output is license compatible? >>>>>>>>>>>>> >>>>>>>>>>>>> 2 would basically prohibit locally run models. >>>>>>>>>>>