Re: Accepting AI generated contributions

Josh McKenzie Thu, 31 Jul 2025 13:56:00 -0700

Do we have a consensus on this topic or is there still further discussion to be 
had?


On Thu, Jul 24, 2025, at 8:26 AM, David Capwell wrote:
> Given the above, code generated in whole or in part using AI can be 
> contributed if the contributor ensures that: The terms and conditions of the 
> generative AI tool do not place any restrictions on use of the output that 
> would be inconsistent with the Open Source Definition. At least one of the 
> following conditions is met: The output is not copyrightable subject matter 
> (and would not be even if produced by a human). No third party materials are 
> included in the output. Any third party materials that are included in the 
> output are being used with permission (e.g., under a compatible open-source 
> license) of the third party copyright holders and in compliance with the 
> applicable license terms. A contributor obtains reasonable certainty that 
> conditions 2.2 or 2.3 are met if the AI tool itself provides sufficient 
> information about output that may be similar to training data, or from code 
> scanning results
> ASF Generative Tooling Guidance 
> <https://www.apache.org/legal/generative-tooling.html#:~:text=Given%20the%20above,code%20scanning%20results>
> apache.org 
> <https://www.apache.org/legal/generative-tooling.html#:~:text=Given%20the%20above,code%20scanning%20results>
> apple-touch-icon-180x180.png 
> <https://www.apache.org/legal/generative-tooling.html#:~:text=Given%20the%20above,code%20scanning%20results>
> 
> 
> Ariel shared this at the start.  Right now we must know what tool was used so 
> we can make sure its license is ok.  The only tool currently flagged as not 
> acceptable is OpenAI as it has wordings limiting what you may do with its 
> output.
> 
> Sent from my iPhone
> 
>> On Jul 23, 2025, at 1:31 PM, Jon Haddad <[email protected]> wrote:
>> 
>> +1 to Patrick's proposal.
>> 
>> On Wed, Jul 23, 2025 at 12:37 PM Patrick McFadin <[email protected]> wrote:
>>> I just did some review on all the case law around copywrite and AI code. So 
>>> far, every claim has been dismissed. There are some other cases like 
>>> NYTimes which have more merit and are proceeding. 
>>> 
>>> Which leads me to the opinion that this is feeling like a premature 
>>> optimization. Somebody creating a PR should not have to also submit a SBOM, 
>>> which is essentially what we’re asking. It’s undue burden and friction on 
>>> the process when we should be looking for ways to reduce friction. 
>>> 
>>> My proposal is no disclosures required. 
>>> 
>>> On Wed, Jul 23, 2025 at 12:06 PM Yifan Cai <[email protected]> wrote:
>>>> According to the thread, the disclosure is for legal purposes. For 
>>>> example, the patch is not produced by OpenAI's service. I think having the 
>>>> discussion to clarify the AI usage in the projects is meaningful. I guess 
>>>> many are hesitating because of the unclarity in the area. 
>>>> 
>>>> > I don’t believe or agree with us assuming we should do this for every PR
>>>> 
>>>> I am with you, David. Updating the mail list for PRs is overwhelming for 
>>>> both the author and the community. 
>>>> 
>>>> I also do not feel co-author is the best place. 
>>>> 
>>>> - Yifan
>>>> 
>>>> On Wed, Jul 23, 2025 at 11:51 AM Patrick McFadin <[email protected]> 
>>>> wrote:
>>>>> This is starting to get ridiculous. Disclosure statements on exactly how 
>>>>> a problem was solved? What’s next? Time cards? 
>>>>> 
>>>>> It’s time to accept the world as it is. AI is in the coding toolbox now 
>>>>> just like IDEs, linters and code formatters. Some may not like using 
>>>>> them, some may love using them. What matters is that a problem was 
>>>>> solved, the code matches whatever quality standard the project upholds 
>>>>> which should be enforced by testing and code reviews. 
>>>>> 
>>>>> Patrick
>>>>> 
>>>>> On Wed, Jul 23, 2025 at 11:31 AM David Capwell <[email protected]> wrote:
>>>>>>> David is disclosing it in the maillist and the GH page. Should the 
>>>>>>> disclosure be persisted in the commit? 
>>>>>> 
>>>>>> Someone asked me to update the ML, but I don’t believe or agree with us 
>>>>>> assuming we should do this for every PR; personally storing this in the 
>>>>>> PR description is fine to me as you are telling the reviewers (who you 
>>>>>> need to communicate this to).
>>>>>> 
>>>>>> 
>>>>>>> I’d say we can use the co-authored part of our commit messages to 
>>>>>>> disclose the actual AI that was used? 
>>>>>> 
>>>>>> Heh... I kinda feel dirty doing that… No one does that when they take 
>>>>>> something from a blog or stack overflow, but when you do that you should 
>>>>>> still attribute by linking… which I guess is what Co-Authored does?
>>>>>> 
>>>>>> I don’t know… feels dirty...
>>>>>> 
>>>>>> 
>>>>>>> On Jul 23, 2025, at 11:19 AM, Bernardo Botella 
>>>>>>> <[email protected]> wrote:
>>>>>>> 
>>>>>>> That’s a great point. I’d say we can use the co-authored part of our 
>>>>>>> commit messages to disclose the actual AI that was used? 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>> On Jul 23, 2025, at 10:57 AM, Yifan Cai <[email protected]> wrote:
>>>>>>>> 
>>>>>>>> Curious, what are the good ways to disclose the information? 
>>>>>>>> 
>>>>>>>> > All of which comes back to: if people disclose if they used AI, what 
>>>>>>>> > models, and whether they used the code or text the model wrote 
>>>>>>>> > verbatim or used it as a scaffolding and then heavily modified 
>>>>>>>> > everything I think we'll be in a pretty good spot.
>>>>>>>> 
>>>>>>>> David is disclosing it in the maillist and the GH page. Should the 
>>>>>>>> disclosure be persisted in the commit? 
>>>>>>>> 
>>>>>>>> - Yifan
>>>>>>>> 
>>>>>>>> On Wed, Jul 23, 2025 at 8:47 AM David Capwell <[email protected]> 
>>>>>>>> wrote:
>>>>>>>>> Sent out this patch that was written 100% by Claude: 
>>>>>>>>> https://github.com/apache/cassandra/pull/4266
>>>>>>>>> 
>>>>>>>>> Claudes license doesn’t have issues with the current ASF policy as 
>>>>>>>>> far as I can tell.  If you look at the patch it’s very clear there 
>>>>>>>>> isn’t any copywriter material (its glueing together C* classes).
>>>>>>>>> 
>>>>>>>>> I could have written this my self but I had to focus on code reviews 
>>>>>>>>> and also needed this patch out, so asked Claude to write it for me so 
>>>>>>>>> I could focus on reviews.  I have reviewed it myself and it’s 
>>>>>>>>> basically the same code I would have written (notice how small and 
>>>>>>>>> focused the patch is, larger stuff doesn’t normally pass my peer 
>>>>>>>>> review).
>>>>>>>>> 
>>>>>>>>>> On Jun 25, 2025, at 2:37 PM, David Capwell <[email protected]> 
>>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>> +1 to what Josh said
>>>>>>>>>> Sent from my iPhone
>>>>>>>>>> 
>>>>>>>>>>> On Jun 25, 2025, at 1:18 PM, Josh McKenzie <[email protected]> 
>>>>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> Did some more digging. Apparently the way a lot of 
>>>>>>>>>>> headline-grabbers have been making models reproduce code verbatim 
>>>>>>>>>>> is to prompt them with dozens of verbatim tokens of copyrighted 
>>>>>>>>>>> code as input where completion is then very heavily weighted to 
>>>>>>>>>>> regurgitate the initial implementation. Which makes sense; if you 
>>>>>>>>>>> copy/paste 100 lines of copyrighted code, the statistically likely 
>>>>>>>>>>> completion for that will be that initial implementation.
>>>>>>>>>>> 
>>>>>>>>>>> For local LLM's, the likelihood of verbatim reproduction is 
>>>>>>>>>>> *differently* but apparently comparably unlikely because they have 
>>>>>>>>>>> far fewer parameters (32B vs. 671B for Deepseek for instance) of 
>>>>>>>>>>> their pre-training corpus of trillions (30T in the case of 
>>>>>>>>>>> Qwen3-32B for instance), so the individual tokens from the 
>>>>>>>>>>> copyrighted material are highly unlikely to be actually *stored* in 
>>>>>>>>>>> the model to be reproduced, and certainly not in sequence. They 
>>>>>>>>>>> don't have the post-generation checks claimed by the SOTA models, 
>>>>>>>>>>> but are apparently considered in the "< 1 in 10,000 completions 
>>>>>>>>>>> will generate copyrighted code" territory.
>>>>>>>>>>> 
>>>>>>>>>>> When asked a human language prompt, or a multi-agent pipelined 
>>>>>>>>>>> "still human language but from your architect agent" prompt, the 
>>>>>>>>>>> likelihood of producing a string of copyrighted code in that manner 
>>>>>>>>>>> is statistically very, very low. I think we're at far more risk of 
>>>>>>>>>>> contributors copy/pasting stack overflow or code from other 
>>>>>>>>>>> projects than we are from modern genAI models producing blocks of 
>>>>>>>>>>> copyrighted code.
>>>>>>>>>>> 
>>>>>>>>>>> All of which comes back to: if people disclose if they used AI, 
>>>>>>>>>>> what models, and whether they used the code or text the model wrote 
>>>>>>>>>>> verbatim or used it as a scaffolding and then heavily modified 
>>>>>>>>>>> everything I think we'll be in a pretty good spot.
>>>>>>>>>>> 
>>>>>>>>>>> On Wed, Jun 25, 2025, at 12:47 PM, David Capwell wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 2. Models that do not do output filtering to restrict the 
>>>>>>>>>>>>>> reproduction of training data unless the tool can ensure the 
>>>>>>>>>>>>>> output is license compatible?
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 2 would basically prohibit locally run models.
>>>>>>>>>>>> 
>>>>>>>>>>>> I am not for this for the reasons listed above. There isn’t a 
>>>>>>>>>>>> difference between this and a contributor copying code and sending 
>>>>>>>>>>>> our way. We still need to validate the code can be accepted .
>>>>>>>>>>>> 
>>>>>>>>>>>> We also have the issue of having this be a broad stroke. If the 
>>>>>>>>>>>> user asked a model to write a test for the code the human wrote, 
>>>>>>>>>>>> we reject the contribution as they used a local model? This poses 
>>>>>>>>>>>> very little copywriting risk yet our policy would now reject
>>>>>>>>>>>> 
>>>>>>>>>>>> Sent from my iPhone
>>>>>>>>>>>> 
>>>>>>>>>>>>> On Jun 25, 2025, at 9:10 AM, Ariel Weisberg <[email protected]> 
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>> 2. Models that do not do output filtering to restrict the 
>>>>>>>>>>>>> reproduction of training data unless the tool can ensure the 
>>>>>>>>>>>>> output is license compatible?
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 2 would basically prohibit locally run models.
>>>>>>>>>>>

Re: Accepting AI generated contributions

Reply via email to