I'm on with the allow list(1) or option 2. 3 just isn't realistic anymore.
Patrick On Mon, Jun 16, 2025 at 3:09 PM Caleb Rackliffe <calebrackli...@gmail.com> wrote: > I haven't participated much here, but my vote would be basically #1, i.e. > an "allow list" with a clear procedure for expansion. > > On Mon, Jun 16, 2025 at 4:05 PM Ariel Weisberg <ar...@weisberg.ws> wrote: > >> Hi, >> >> We could, but if the allow list is binding then it's still an allow list >> with some guidance on how to expand the allow list. >> >> If it isn't binding then it's guidance so still option 2 really. >> >> I think the key distinction to find some early consensus on if we do a >> binding allow list or guidance, and then we can iron out the guidance, but >> I think that will be less controversial to work out. >> >> Or option 3 which is not accepting AI generated contributions. I think >> there are some with healthy skepticism of AI generated code, but so far I >> haven't met anyone who wants to forbid it entirely. >> >> Ariel >> >> On Mon, Jun 16, 2025, at 4:54 PM, Josh McKenzie wrote: >> >> Couldn't our official stance be a combination of 1 and 2? i.e. "Here's an >> allow list. If you're using something not on that allow list, here's some >> basic guidance and maybe let us know how you tried to mitigate some of this >> risk so we can update our allow list w/some nuance". >> >> On Mon, Jun 16, 2025, at 4:39 PM, Ariel Weisberg wrote: >> >> Hi, >> >> On Wed, Jun 11, 2025, at 3:48 PM, Jeremiah Jordan wrote: >> >> Where are you getting this from? From the OpenAI terms of use: >> https://openai.com/policies/terms-of-use/ >> >> Direct from the ASF legal mailing list discussion I linked to in my >> original email calling this out >> https://lists.apache.org/thread/lcvxnpf39v22lc3f9t5fo07p19237d16 It's >> not clear from that thread precisely what they are objecting to and whether >> it has changed (another challenge!), but I believe it's restrictions on >> what you are allowed to do with the output of OpenAI models. And if you get >> the output via other service's it's under a different license and it's fine! >> >> Already we are demonstrating that it is not trivial understand what is >> and isn't allowed >> >> On Wed, Jun 11, 2025, at 3:48 PM, Jeremiah Jordan wrote: >> >> I still maintain that trying to publish an exhaustive list of acceptable >> tools does not seem reasonable. >> But I agree that giving people guidance is possible. Maybe having a >> statement in the contribution guidelines along the lines of: >> >> The list doesn't need to be exhaustive. We are not required to accept AI >> generated code at all! >> >> We can make a best effort to just vet the ones that people actually want >> to widely use and refuse everything else and be better off than allowing >> people to use tools that are known not to be license compatible or make >> little/no effort to avoid reproducing large amounts of copyrighted code. >> >> “Make sure your tools do X, here are some that at the time of being added >> to this list did X, Tool A, Tool B … >> Here is a list of tools that at the time of being added to this list did >> not satisfy X. Tool Z - reason why” >> >> I would be fine with this as an outcome. If we voted with multiple >> options it wouldn't be my first choice. >> >> This thread only has 4 participants so far so it's hard to get a signal >> on what people would want if we tried to vote. >> >> David, Scott, anyone else if the options were: >> >> 1. Allow list >> 2. Basic guidance as suggested by Jeremiah, but primarily leave it up >> to contributor/reviewer >> 3. Do nothing >> 4. My choice isn't here >> >> What would you want? >> >> My vote in choice order is 1,2,3. >> >> Ariel >> >> >> On Wed, Jun 11, 2025, at 3:48 PM, Jeremiah Jordan wrote: >> >> >> I respectfully mean that contributors, reviewers, and committers can't >> feasibly understand and enforce the ASF guidelines. >> >> If this is true, then the ASF is in a lot of trouble and you should bring >> it up with the ASF board. >> Where are you getting this from? From the OpenAI terms of use: >> https://openai.com/policies/terms-of-use/ >> >> We don't even necessarily need to be that restrictive beyond requiring >> tools that make at least some effort not to reproduce large amounts of >> copyrighted code that may/may not be license compatible or tools that are >> themselves not license compatible. This ends up encompassing most of the >> ones people want to use anyways. >> >> >> There is a non-zero amount we can do to educate and guide that would be >> better than pointing people to the ASF guidelines and leaving it at that. >> >> >> I still maintain that trying to publish an exhaustive list of acceptable >> tools does not seem reasonable. >> But I agree that giving people guidance is possible. Maybe having a >> statement in the contribution guidelines along the lines of: >> “Make sure your tools do X, here are some that at the time of being added >> to this list did X, Tool A, Tool B … >> Here is a list of tools that at the time of being added to this list did >> not satisfy X. Tool Z - reason why” >> >> -Jeremiah >> >> >> On Jun 11, 2025 at 11:48:30 AM, Ariel Weisberg <ar...@weisberg.ws> wrote: >> >> >> Hi, >> >> I am not saying you said it, but I respectfully mean that contributors, >> reviewers, and committers can't feasibly understand and enforce the ASF >> guidelines. We would be another link in a chain of people abdicating >> responsibility starting with LLM vendors serving up models that reproduce >> copyrighted code, then going to ASF legal which gives us guidelines without >> the tools to enforce those guidelines, and now we (the PMC) would be doing >> the same to contributors, reviewers, and committers. >> >> I don’t think anyone is going to be able to maintain and enforce a list >> of acceptable tools for contributors to the project to stick to. We can’t >> know what someone did on their laptop, all we can do is evaluate the code >> they submit. >> >> I agree we might not be able to do a perfect job at any aspect of trying >> to make sure that the code we accept is not problematic I some way, but >> that doesn't mean we shouldn't try? >> >> We don't even necessarily need to be that restrictive beyond requiring >> tools that make at least some effort not to reproduce large amounts of >> copyrighted code that may/may not be license compatible or tools that are >> themselves not license compatible. This ends up encompassing most of the >> ones people want to use anyways. >> >> How many people are aware that if you get code from OpenAI directly that >> the license isn't ASL compatible, but that if you get it via Microsoft >> services that use OpenAI models that it's ASL compatible? It's not in the >> ASF guidelines (it was but they removed it!). >> >> How many people are aware that when people use locally run models there >> is no output filtering further increasing the odds of the model reproducing >> copyright encumbered code? >> >> There is a non-zero amount we can do to educate and guide that would be >> better than pointing people to the ASF guidelines and leaving it at that. >> >> The ASF guidelines themselves have suggestions like requiring people to >> say if they used AI and then which AI. I don't think it's very useful >> beyond checking license compatibility of the AI itself, but that is >> something we should be doing so it might as well be documented and included >> in the PR text. >> >> Ariel >> >> On Mon, Jun 2, 2025, at 7:54 PM, Jeremiah Jordan wrote: >> >> I don’t think I said we should abdicate responsibility? I said the key >> point is that contributors, and more importantly reviewers and committers >> understand the ASF guidelines and hold all code to those standards. Any >> suspect code should be blocked during review. As Roman says in your quote, >> this isn’t about AI, it’s about copyright. If someone submits copyrighted >> code to the project, whether an AI generated it or they just grabbed it >> from a Google search, it’s on the project to try not to accept it. >> I don’t think anyone is going to be able to maintain and enforce a list >> of acceptable tools for contributors to the project to stick to. We can’t >> know what someone did on their laptop, all we can do is evaluate the code >> they submit. >> >> -Jeremiah >> >> On Mon, Jun 2, 2025 at 6:39 PM Ariel Weisberg <ar...@weisberg.ws> wrote: >> >> >> Hi, >> >> As PMC members/committers we aren't supposed to abdicate this to legal or >> to contributors. Despite the fact that we aren't equipped to solve this >> problem we are supposed to be making sure that code contributed is >> non-infringing. >> >> This is a quotation from Roman Shaposhnik from this legal thread >> https://lists.apache.org/thread/f6k93xx67pc33o0yhm24j3dpq0323gyd >> >> Yes, because you have to. Again -- forget about AI -- if a drive-by >> contributor submits a patch that has huge amounts of code stolen from some >> existing copyright holder -- it is very much ON YOU as a committer/PMC to >> prevent that from happening. >> >> >> We aren't supposed to knowingly allow people to use AI tools that are >> known to generate infringing contributions or contributions which are not >> license compatible (such as OpenAI terms of use). >> >> Ariel >> On Mon, Jun 2, 2025, at 7:08 PM, Jeremiah Jordan wrote: >> >> > Ultimately it's the contributor's (and committer's) job to ensure that >> their contributions meet the bar for acceptance >> To me this is the key point. Given how pervasive this stuff is becoming, >> I don’t think it’s feasible to make some list of tools and enforce it. >> Even without getting into extra tools, IDEs (including IntelliJ) are doing >> more and more LLM based code suggestion as time goes on. >> I think we should point people to the ASF Guidelines around such tools, >> and the guidelines around copyrighted code, and then continue to review >> patches with the high standards we have always had in this project. >> >> -Jeremiah >> >> On Mon, Jun 2, 2025 at 5:52 PM Ariel Weisberg <ar...@weisberg.ws> wrote: >> >> >> Hi, >> >> To clarify are you saying that we should not accept AI generated code >> until it has been looked at by a human and then written again with >> different "wording" to ensure that it doesn't directly copy anything? >> >> Or do you mean something else about the quality of "vibe coding" and how >> we shouldn't allow it because it makes bad code? Ultimately it's the >> contributor's (and committer's) job to ensure that their contributions meet >> the bar for acceptance and I don't think we should tell them how to go >> about meeting that bar beyond what is needed to address the copyright >> concern. >> >> I agree that the bar set by the Apache guidelines are pretty high. They >> are simultaneously impossible and trivial to meet depending on how you >> interpret them and we are not very well equipped to interpret them. >> >> It would have been more straightforward for them to simply say no, but >> they didn't opt to do that as if there is some way for PMCs to acceptably >> take AI generated contributions. >> >> Ariel >> >> On Mon, Jun 2, 2025, at 5:03 PM, David Capwell wrote: >> >> fine tuning encourage not reproducing things verbatim >> >> I think not producing copyrighted output from your training data is a >> technically feasible achievement for these vendors so I have a moderate >> level of trust they will succeed at it if they say they do it. >> >> >> Some team members and I discussed this in the context of my documentation >> patch (which utilized Claude during composition). I conducted an experiment >> to pose high-level Cassandra-related questions to a model without >> additional context, while adjusting the temperature parameter (tested at >> 0.2, 0.5, and 0.8). The results revealed that each test generated content >> copied verbatim from a specific non-Apache (and non-DSE) website. I did not >> verify whether this content was copyrighted, though it was easily >> identifiable through a simple Google search. This occurred as a single >> sentence within the generated document, and as I am not a legal expert, I >> cannot determine whether this constitutes a significant issue. >> >> The complexity increases when considering models trained on different >> languages, which may translate content into English. In such cases, a >> Google search would fail to detect the origin. Is this still considered >> plagiarism? Does it violate copyright laws? I am uncertain. >> >> Similar challenges arise with code generation. For instance, if a model >> is trained on a GPL-licensed Python library that implements a novel data >> structure, and the model subsequently rewrites this structure in Java, a >> Google search is unlikely to identify the source. >> >> Personally, I do not assume these models will avoid producing copyrighted >> material. This doesn’t mean I am against AI at all, but rather reflects >> my belief that the requirements set by Apache are not easily “provable” in >> such scenarios. >> >> >> My personal opinion is that we should at least consider allow listing a >> few specific sources (any vendor that scans output for infringement) and >> add that to the PR template and in other locations (readme, web site). >> Bonus points if we can set up code scanning (useful for non-AI >> contributions!). >> >> >> My perspective, after trying to see what AI can do is the following: >> >> Strengths >> * Generating a preliminary draft of a document and assisting with >> iterative revisions >> * Documenting individual methods >> * Generation of “simple” methods and scripts, provided the underlying >> libraries are well-documented in public repositories >> * Managing repetitive or procedural tasks, such as “migrating from X to >> Y” or “converting serializations to the X interface” >> >> Limitations >> * Producing a fully functional document in a single attempt that meets >> merge standards. When documenting Gens.java and Property.java, the output >> appeared plausible but contained frequent inaccuracies. >> * Addressing complex or ambiguous scenarios (“gossip”), though this >> challenge is not unique to AI—Matt Byrd and I tested Claude for >> CASSANDRA-20659, where it could identify relevant code but proposed >> solutions that risked corrupting production clusters. >> * Interpreting large-scale codebases. Beyond approximately 300 lines of >> actual code (excluding formatting), performance degrades significantly, >> leading to a marked decline in output quality. >> >> Note: When referring to AI/LLMs, I am not discussing interactions with a >> user interface to execute specific tasks, but rather leveraging code agents >> like Roo and Aider to provide contextual information to the LLM. >> >> Given these observations, it remains challenging to determine optimal >> practices. In some contexts its very clear to tell that nothing was >> taking from external work (e.g., “create a test using our BTree class >> that inserts a row with a null column,” “analyze this function’s purpose”). >> However, for substantial tasks, the situation becomes more complex. If the >> author employed AI as a collaborative tool during “pair programming,” >> concerns are not really that different than google searches (unless the >> work involves unique elements like introducing new data structures or >> indexes). Conversely, if the author “vibe coded” the entire patch, two >> primary concerns arise: does the author have writes to the code and whether >> its quality aligns with requirements. >> >> >> TL;DR - I am not against AI contributions, but strongly prefer its done >> as “pair programing”. My experience with “vibe coding” makes me worry >> about the quality of the code, and that the author is less likely to >> validate that the code generated is safe to donate. >> >> This email was generated with the help of AI =) >> >> >> On May 30, 2025, at 3:00 PM, Ariel Weisberg <ar...@weisberg.ws> wrote: >> >> Hi all, >> >> It looks like we haven't discussed this much and haven't settled on a >> policy for what kinds of AI generated contributions we accept and what >> vetting is required for them. >> >> >> https://www.apache.org/legal/generative-tooling.html#:~:text=Given%20the%20above,code%20scanning%20results >> . >> >> ``` >> Given the above, code generated in whole or in part using AI can be >> contributed if the contributor ensures that: >> >> 1. The terms and conditions of the generative AI tool do not place any >> restrictions on use of the output that would be inconsistent with the Open >> Source Definition. >> 2. At least one of the following conditions is met: >> 2.1 The output is not copyrightable subject matter (and would not be >> even if produced by a human). >> 2.2 No third party materials are included in the output. >> 2.3 Any third party materials that are included in the output are >> being used with permission (e.g., under a compatible open-source license) >> of the third party copyright holders and in compliance with the applicable >> license terms. >> 3. A contributor obtains reasonable certainty that conditions 2.2 or 2.3 >> are met if the AI tool itself provides sufficient information about output >> that may be similar to training data, or from code scanning results. >> ``` >> >> There is a lot to unpack there, but it seems like any one of 2 needs to >> be met, and 3 describes how 2.2 and 2.3 can be satisfied. >> >> 2.1 is tricky as we are not copyright lawyers, and 2.2 and 2.3 is a >> pretty high bar in that it's hard to know if you have met it. Do we have >> anyone in the community running any code scanning tools already? >> >> Here is the JIRA for addition of the generative AI policy: >> https://issues.apache.org/jira/browse/LEGAL-631 >> Legal mailing list discussion of the policy: >> https://lists.apache.org/thread/vw3jf4726yrhovg39mcz1y89mx8j4t8s >> Legal mailing list discussion of compliant tools: >> https://lists.apache.org/thread/nzyl311q53xhpq99grf6l1h076lgzybr >> Legal mailing list discussion about how Open AI terms are not Apache >> compatible: >> https://lists.apache.org/thread/lcvxnpf39v22lc3f9t5fo07p19237d16 >> Hadoop mailing list message hinting that they accept contributions but >> ask which tool: >> https://lists.apache.org/thread/bgs8x1f9ovrjmhg6b450bz8bt7o43yxj >> Spark mailing list message where they have given up on stopping people: >> https://lists.apache.org/thread/h6621sxfxcnnpsoyr31x65z207kk80fr >> >> I didn't see other projects discussing and deciding how to handle these >> contributions, but I also didn't check that many of them only Hadoop, >> Spark, Druid, Pulsar. I also can't see their PMC mailing list. >> >> I asked O3 to deep research what is done to avoid producing copyrighted >> code: https://chatgpt.com/share/683a2983-dd9c-8009-9a66-425012af840d >> >> To summarize training deduplicates training so the model is less likely >> to spit reproduce it verbatim, prompts and fine tuning encourage not >> reproducing things verbatim, the inference is biased to not pick the best >> option but some neighboring one encouraging originality, and in some >> instances the output is checked to make sure it doesn't match the training >> data. So to some extent 2.2 is being done to different degrees depending on >> what product you are using. >> >> It's worth noting that scanning the output can be probabilistic in the >> case of say Anthropic and they still recommend code scanning. >> >> Quite notably Anthropic for its enterprise users indemnifies them against >> copyright claims. It's not perfect, but it does mean they have an incentive >> to make sure there are fewer copyright claims. We could choose to be picky >> and only accept specific sources of LLM generated code based on perceived >> safety. >> >> I think not producing copyrighted output from your training data is a >> technically feasible achievement for these vendors so I have a moderate >> level of trust they will succeed at it if they say they do it. >> >> I could send a message to the legal list asking for clarification and a >> set of tools, but based on Roman's communication ( >> https://lists.apache.org/thread/f6k93xx67pc33o0yhm24j3dpq0323gyd) I >> think this is kind of what we get. It's on us to ensure the contributions >> are kosher either by code scanning or accepting that the LLM vendors are >> doing a good job at avoiding copyrighted output. >> >> My personal opinion is that we should at least consider allow listing a >> few specific sources (any vendor that scans output for infringement) and >> add that to the PR template and in other locations (readme, web site). >> Bonus points if we can set up code scanning (useful for non-AI >> contributions!). >> >> Regards, >> Ariel >> >> >> >> >> >> >> >>