Re: Accepting AI generated contributions

Patrick McFadin Mon, 16 Jun 2025 16:22:21 -0700

I'm on with the allow list(1) or option 2.  3 just isn't realistic anymore.


Patrick



On Mon, Jun 16, 2025 at 3:09 PM Caleb Rackliffe <[email protected]>
wrote:

> I haven't participated much here, but my vote would be basically #1, i.e.
> an "allow list" with a clear procedure for expansion.
>
> On Mon, Jun 16, 2025 at 4:05 PM Ariel Weisberg <[email protected]> wrote:
>
>> Hi,
>>
>> We could, but if the allow list is binding then it's still an allow list
>> with some guidance on how to expand the allow list.
>>
>> If it isn't binding then it's guidance so still option 2 really.
>>
>> I think the key distinction to find some early consensus on if we do a
>> binding allow list or guidance, and then we can iron out the guidance, but
>> I think that will be less controversial to work out.
>>
>> Or option 3 which is not accepting AI generated contributions. I think
>> there are some with healthy skepticism of AI generated code, but so far I
>> haven't met anyone who wants to forbid it entirely.
>>
>> Ariel
>>
>> On Mon, Jun 16, 2025, at 4:54 PM, Josh McKenzie wrote:
>>
>> Couldn't our official stance be a combination of 1 and 2? i.e. "Here's an
>> allow list. If you're using something not on that allow list, here's some
>> basic guidance and maybe let us know how you tried to mitigate some of this
>> risk so we can update our allow list w/some nuance".
>>
>> On Mon, Jun 16, 2025, at 4:39 PM, Ariel Weisberg wrote:
>>
>> Hi,
>>
>> On Wed, Jun 11, 2025, at 3:48 PM, Jeremiah Jordan wrote:
>>
>> Where are you getting this from?  From the OpenAI terms of use:
>> https://openai.com/policies/terms-of-use/
>>
>> Direct from the ASF legal mailing list discussion I linked to in my
>> original email calling this out
>> https://lists.apache.org/thread/lcvxnpf39v22lc3f9t5fo07p19237d16 It's
>> not clear from that thread precisely what they are objecting to and whether
>> it has changed (another challenge!), but I believe it's restrictions on
>> what you are allowed to do with the output of OpenAI models. And if you get
>> the output via other service's it's under a different license and it's fine!
>>
>> Already we are demonstrating that it is not trivial understand what is
>> and isn't allowed
>>
>> On Wed, Jun 11, 2025, at 3:48 PM, Jeremiah Jordan wrote:
>>
>> I still maintain that trying to publish an exhaustive list of acceptable
>> tools does not seem reasonable.
>> But I agree that giving people guidance is possible.  Maybe having a
>> statement in the contribution guidelines along the lines of:
>>
>> The list doesn't need to be exhaustive. We are not required to accept AI
>> generated code at all!
>>
>> We can make a best effort to just vet the ones that people actually want
>> to widely use and refuse everything else and be better off than allowing
>> people to use tools that are known not to be license compatible or make
>> little/no effort to avoid reproducing large amounts of copyrighted code.
>>
>> “Make sure your tools do X, here are some that at the time of being added
>> to this list did X, Tool A, Tool B …
>> Here is a list of tools that at the time of being added to this list did
>> not satisfy X. Tool Z - reason why”
>>
>> I would be fine with this as an outcome. If we voted with multiple
>> options it wouldn't be my first choice.
>>
>> This thread only has 4 participants so far so it's hard to get a signal
>> on what people would want if we tried to vote.
>>
>> David, Scott, anyone else if the options were:
>>
>>    1. Allow list
>>    2. Basic guidance as suggested by Jeremiah, but primarily leave it up
>>    to contributor/reviewer
>>    3. Do nothing
>>    4. My choice isn't here
>>
>> What would you want?
>>
>> My vote in choice order is 1,2,3.
>>
>> Ariel
>>
>>
>> On Wed, Jun 11, 2025, at 3:48 PM, Jeremiah Jordan wrote:
>>
>>
>>  I respectfully mean that contributors, reviewers, and committers can't
>> feasibly understand and enforce the ASF guidelines.
>>
>> If this is true, then the ASF is in a lot of trouble and you should bring
>> it up with the ASF board.
>> Where are you getting this from?  From the OpenAI terms of use:
>> https://openai.com/policies/terms-of-use/
>>
>> We don't even necessarily need to be that restrictive beyond requiring
>> tools that make at least some effort not to reproduce large amounts of
>> copyrighted code that may/may not be license compatible or tools that are
>> themselves not license compatible. This ends up encompassing most of the
>> ones people want to use anyways.
>>
>>
>> There is a non-zero amount we can do to educate and guide that would be
>> better than pointing people to the ASF guidelines and leaving it at that.
>>
>>
>> I still maintain that trying to publish an exhaustive list of acceptable
>> tools does not seem reasonable.
>> But I agree that giving people guidance is possible.  Maybe having a
>> statement in the contribution guidelines along the lines of:
>> “Make sure your tools do X, here are some that at the time of being added
>> to this list did X, Tool A, Tool B …
>> Here is a list of tools that at the time of being added to this list did
>> not satisfy X. Tool Z - reason why”
>>
>> -Jeremiah
>>
>>
>> On Jun 11, 2025 at 11:48:30 AM, Ariel Weisberg <[email protected]> wrote:
>>
>>
>> Hi,
>>
>> I am not saying you said it, but I respectfully mean that contributors,
>> reviewers, and committers can't feasibly understand and enforce the ASF
>> guidelines. We would be another link in a chain of people abdicating
>> responsibility starting with LLM vendors serving up models that reproduce
>> copyrighted code, then going to ASF legal which gives us guidelines without
>> the tools to enforce those guidelines, and now we (the PMC) would be doing
>> the same to contributors, reviewers, and committers.
>>
>> I don’t think anyone is going to be able to maintain and enforce a list
>> of acceptable tools for contributors to the project to stick to. We can’t
>> know what someone did on their laptop, all we can do is evaluate the code
>> they submit.
>>
>> I agree we might not be able to do a perfect job at any aspect of trying
>> to make sure that the code we accept is not problematic I some way, but
>> that doesn't mean we shouldn't try?
>>
>> We don't even necessarily need to be that restrictive beyond requiring
>> tools that make at least some effort not to reproduce large amounts of
>> copyrighted code that may/may not be license compatible or tools that are
>> themselves not license compatible. This ends up encompassing most of the
>> ones people want to use anyways.
>>
>> How many people are aware that if you get code from OpenAI directly that
>> the license isn't ASL compatible, but that if you get it via Microsoft
>> services that use OpenAI models that it's ASL compatible? It's not in the
>> ASF guidelines (it was but they removed it!).
>>
>> How many people are aware that when people use locally run models there
>> is no output filtering further increasing the odds of the model reproducing
>> copyright encumbered code?
>>
>> There is a non-zero amount we can do to educate and guide that would be
>> better than pointing people to the ASF guidelines and leaving it at that.
>>
>> The ASF guidelines themselves have suggestions like requiring people to
>> say if they used AI and then which AI. I don't think it's very useful
>> beyond checking license compatibility of the AI itself, but that is
>> something we should be doing so it might as well be documented and included
>> in the PR text.
>>
>> Ariel
>>
>> On Mon, Jun 2, 2025, at 7:54 PM, Jeremiah Jordan wrote:
>>
>> I don’t think I said we should abdicate responsibility?  I said the key
>> point is that contributors, and more importantly reviewers and committers
>> understand the ASF guidelines and hold all code to those standards. Any
>> suspect code should be blocked during review. As Roman says in your quote,
>> this isn’t about AI, it’s about copyright. If someone submits copyrighted
>> code to the project, whether an AI generated it or they just grabbed it
>> from a Google search, it’s on the project to try not to accept it.
>> I don’t think anyone is going to be able to maintain and enforce a list
>> of acceptable tools for contributors to the project to stick to. We can’t
>> know what someone did on their laptop, all we can do is evaluate the code
>> they submit.
>>
>> -Jeremiah
>>
>> On Mon, Jun 2, 2025 at 6:39 PM Ariel Weisberg <[email protected]> wrote:
>>
>>
>> Hi,
>>
>> As PMC members/committers we aren't supposed to abdicate this to legal or
>> to contributors. Despite the fact that we aren't equipped to solve this
>> problem we are supposed to be making sure that code contributed is
>> non-infringing.
>>
>> This is a quotation from Roman Shaposhnik from this legal thread
>> https://lists.apache.org/thread/f6k93xx67pc33o0yhm24j3dpq0323gyd
>>
>> Yes, because you have to. Again -- forget about AI -- if a drive-by
>> contributor submits a patch that has huge amounts of code stolen from some
>> existing copyright holder -- it is very much ON YOU as a committer/PMC to
>> prevent that from happening.
>>
>>
>> We aren't supposed to knowingly allow people to use AI tools that are
>> known to generate infringing contributions or contributions which are not
>> license compatible (such as OpenAI terms of use).
>>
>> Ariel
>> On Mon, Jun 2, 2025, at 7:08 PM, Jeremiah Jordan wrote:
>>
>> > Ultimately it's the contributor's (and committer's) job to ensure that
>> their contributions meet the bar for acceptance
>> To me this is the key point. Given how pervasive this stuff is becoming,
>> I don’t think it’s feasible to make some list of tools and enforce it.
>> Even without getting into extra tools, IDEs (including IntelliJ) are doing
>> more and more LLM based code suggestion as time goes on.
>> I think we should point people to the ASF Guidelines around such tools,
>> and the guidelines around copyrighted code, and then continue to review
>> patches with the high standards we have always had in this project.
>>
>> -Jeremiah
>>
>> On Mon, Jun 2, 2025 at 5:52 PM Ariel Weisberg <[email protected]> wrote:
>>
>>
>> Hi,
>>
>> To clarify are you saying that we should not accept AI generated code
>> until it has been looked at by a human and then written again with
>> different "wording" to ensure that it doesn't directly copy anything?
>>
>> Or do you mean something else about the quality of "vibe coding" and how
>> we shouldn't allow it because it makes bad code? Ultimately it's the
>> contributor's (and committer's) job to ensure that their contributions meet
>> the bar for acceptance and I don't think we should tell them how to go
>> about meeting that bar beyond what is needed to address the copyright
>> concern.
>>
>> I agree that the bar set by the Apache guidelines are pretty high. They
>> are simultaneously impossible and trivial to meet depending on how you
>> interpret them and we are not very well equipped to interpret them.
>>
>> It would have been more straightforward for them to simply say no, but
>> they didn't opt to do that as if there is some way for PMCs to acceptably
>> take AI generated contributions.
>>
>> Ariel
>>
>> On Mon, Jun 2, 2025, at 5:03 PM, David Capwell wrote:
>>
>> fine tuning encourage not reproducing things verbatim
>>
>> I think not producing copyrighted output from your training data is a
>> technically feasible achievement for these vendors so I have a moderate
>> level of trust they will succeed at it if they say they do it.
>>
>>
>> Some team members and I discussed this in the context of my documentation
>> patch (which utilized Claude during composition). I conducted an experiment
>> to pose high-level Cassandra-related questions to a model without
>> additional context, while adjusting the temperature parameter (tested at
>> 0.2, 0.5, and 0.8). The results revealed that each test generated content
>> copied verbatim from a specific non-Apache (and non-DSE) website. I did not
>> verify whether this content was copyrighted, though it was easily
>> identifiable through a simple Google search. This occurred as a single
>> sentence within the generated document, and as I am not a legal expert, I
>> cannot determine whether this constitutes a significant issue.
>>
>> The complexity increases when considering models trained on different
>> languages, which may translate content into English. In such cases, a
>> Google search would fail to detect the origin. Is this still considered
>> plagiarism? Does it violate copyright laws? I am uncertain.
>>
>> Similar challenges arise with code generation. For instance, if a model
>> is trained on a GPL-licensed Python library that implements a novel data
>> structure, and the model subsequently rewrites this structure in Java, a
>> Google search is unlikely to identify the source.
>>
>> Personally, I do not assume these models will avoid producing copyrighted
>> material. This doesn’t mean I am against AI at all, but rather reflects
>> my belief that the requirements set by Apache are not easily “provable” in
>> such scenarios.
>>
>>
>> My personal opinion is that we should at least consider allow listing a
>> few specific sources (any vendor that scans output for infringement) and
>> add that to the PR template and in other locations (readme, web site).
>> Bonus points if we can set up code scanning (useful for non-AI
>> contributions!).
>>
>>
>> My perspective, after trying to see what AI can do is the following:
>>
>> Strengths
>> * Generating a preliminary draft of a document and assisting with
>> iterative revisions
>> * Documenting individual methods
>> * Generation of “simple” methods and scripts, provided the underlying
>> libraries are well-documented in public repositories
>> * Managing repetitive or procedural tasks, such as “migrating from X to
>> Y” or “converting serializations to the X interface”
>>
>> Limitations
>> * Producing a fully functional document in a single attempt that meets
>> merge standards. When documenting Gens.java and Property.java, the output
>> appeared plausible but contained frequent inaccuracies.
>> * Addressing complex or ambiguous scenarios (“gossip”), though this
>> challenge is not unique to AI—Matt Byrd and I tested Claude for
>> CASSANDRA-20659, where it could identify relevant code but proposed
>> solutions that risked corrupting production clusters.
>> * Interpreting large-scale codebases. Beyond approximately 300 lines of
>> actual code (excluding formatting), performance degrades significantly,
>> leading to a marked decline in output quality.
>>
>> Note: When referring to AI/LLMs, I am not discussing interactions with a
>> user interface to execute specific tasks, but rather leveraging code agents
>> like Roo and Aider to provide contextual information to the LLM.
>>
>> Given these observations, it remains challenging to determine optimal
>> practices. In some contexts its very clear to tell that nothing was
>> taking from external work (e.g., “create a test using our BTree class
>> that inserts a row with a null column,” “analyze this function’s purpose”).
>> However, for substantial tasks, the situation becomes more complex. If the
>> author employed AI as a collaborative tool during “pair programming,”
>> concerns are not really that different than google searches (unless the
>> work involves unique elements like introducing new data structures or
>> indexes). Conversely, if the author “vibe coded” the entire patch, two
>> primary concerns arise: does the author have writes to the code and whether
>> its quality aligns with requirements.
>>
>>
>> TL;DR - I am not against AI contributions, but strongly prefer its done
>> as “pair programing”.  My experience with “vibe coding” makes me worry
>> about the quality of the code, and that the author is less likely to
>> validate that the code generated is safe to donate.
>>
>> This email was generated with the help of AI =)
>>
>>
>> On May 30, 2025, at 3:00 PM, Ariel Weisberg <[email protected]> wrote:
>>
>> Hi all,
>>
>> It looks like we haven't discussed this much and haven't settled on a
>> policy for what kinds of AI generated contributions we accept and what
>> vetting is required for them.
>>
>>
>> https://www.apache.org/legal/generative-tooling.html#:~:text=Given%20the%20above,code%20scanning%20results
>> .
>>
>> ```
>> Given the above, code generated in whole or in part using AI can be
>> contributed if the contributor ensures that:
>>
>> 1. The terms and conditions of the generative AI tool do not place any
>> restrictions on use of the output that would be inconsistent with the Open
>> Source Definition.
>> 2. At least one of the following conditions is met:
>>    2.1 The output is not copyrightable subject matter (and would not be
>> even if produced by a human).
>>    2.2 No third party materials are included in the output.
>>    2.3 Any third party materials that are included in the output are
>> being used with permission (e.g., under a compatible open-source license)
>> of the third party copyright holders and in compliance with the applicable
>> license terms.
>> 3. A contributor obtains reasonable certainty that conditions 2.2 or 2.3
>> are met if the AI tool itself provides sufficient information about output
>> that may be similar to training data, or from code scanning results.
>> ```
>>
>> There is a lot to unpack there, but it seems like any one of 2 needs to
>> be met, and 3 describes how 2.2 and 2.3 can be satisfied.
>>
>> 2.1 is tricky as we are not copyright lawyers, and 2.2 and 2.3 is a
>> pretty high bar in that it's hard to know if you have met it. Do we have
>> anyone in the community running any code scanning tools already?
>>
>> Here is the JIRA for addition of the generative AI policy:
>> https://issues.apache.org/jira/browse/LEGAL-631
>> Legal mailing list discussion of the policy:
>> https://lists.apache.org/thread/vw3jf4726yrhovg39mcz1y89mx8j4t8s
>> Legal mailing list discussion of compliant tools:
>> https://lists.apache.org/thread/nzyl311q53xhpq99grf6l1h076lgzybr
>> Legal mailing list discussion about how Open AI terms are not Apache
>> compatible:
>> https://lists.apache.org/thread/lcvxnpf39v22lc3f9t5fo07p19237d16
>> Hadoop mailing list message hinting that they accept contributions but
>> ask which tool:
>> https://lists.apache.org/thread/bgs8x1f9ovrjmhg6b450bz8bt7o43yxj
>> Spark mailing list message where they have given up on stopping people:
>> https://lists.apache.org/thread/h6621sxfxcnnpsoyr31x65z207kk80fr
>>
>> I didn't see other projects discussing and deciding how to handle these
>> contributions, but I also didn't check that many of them only Hadoop,
>> Spark, Druid, Pulsar. I also can't see their PMC mailing list.
>>
>> I asked O3 to deep research what is done to avoid producing copyrighted
>> code: https://chatgpt.com/share/683a2983-dd9c-8009-9a66-425012af840d
>>
>> To summarize training deduplicates training so the model is less likely
>> to spit reproduce it verbatim, prompts and fine tuning encourage not
>> reproducing things verbatim, the inference is biased to not pick the best
>> option but some neighboring one encouraging originality, and in some
>> instances the output is checked to make sure it doesn't match the training
>> data. So to some extent 2.2 is being done to different degrees depending on
>> what product you are using.
>>
>> It's worth noting that scanning the output can be probabilistic in the
>> case of say Anthropic and they still recommend code scanning.
>>
>> Quite notably Anthropic for its enterprise users indemnifies them against
>> copyright claims. It's not perfect, but it does mean they have an incentive
>> to make sure there are fewer copyright claims. We could choose to be picky
>> and only accept specific sources of LLM generated code based on perceived
>> safety.
>>
>> I think not producing copyrighted output from your training data is a
>> technically feasible achievement for these vendors so I have a moderate
>> level of trust they will succeed at it if they say they do it.
>>
>> I could send a message to the legal list asking for clarification and a
>> set of tools, but based on Roman's communication (
>> https://lists.apache.org/thread/f6k93xx67pc33o0yhm24j3dpq0323gyd) I
>> think this is kind of what we get. It's on us to ensure the contributions
>> are kosher either by code scanning or accepting that the LLM vendors are
>> doing a good job at avoiding copyrighted output.
>>
>> My personal opinion is that we should at least consider allow listing a
>> few specific sources (any vendor that scans output for infringement) and
>> add that to the PR template and in other locations (readme, web site).
>> Bonus points if we can set up code scanning (useful for non-AI
>> contributions!).
>>
>> Regards,
>> Ariel
>>
>>
>>
>>
>>
>>
>>
>>

Re: Accepting AI generated contributions

Reply via email to