Does "*optionally* disclose the LLM used in whatever way you prefer and *definitely* no OpenAI" meet everyone's expectations?
- Yifan On Thu, Jul 31, 2025 at 1:56 PM Josh McKenzie <jmcken...@apache.org> wrote: > Do we have a consensus on this topic or is there still further discussion > to be had? > > On Thu, Jul 24, 2025, at 8:26 AM, David Capwell wrote: > > Given the above, code generated in whole or in part using AI can be > contributed if the contributor ensures that: The terms and conditions of > the generative AI tool do not place any restrictions on use of the output > that would be inconsistent with the Open Source Definition. At least one of > the following conditions is met: The output is not copyrightable subject > matter (and would not be even if produced by a human). No third party > materials are included in the output. Any third party materials that are > included in the output are being used with permission (e.g., under a > compatible open-source license) of the third party copyright holders and in > compliance with the applicable license terms. A contributor obtains > reasonable certainty that conditions 2.2 or 2.3 are met if the AI tool > itself provides sufficient information about output that may be similar to > training data, or from code scanning results > ASF Generative Tooling Guidance > <https://www.apache.org/legal/generative-tooling.html#:~:text=Given%20the%20above,code%20scanning%20results> > apache.org > <https://www.apache.org/legal/generative-tooling.html#:~:text=Given%20the%20above,code%20scanning%20results> > [image: apple-touch-icon-180x180.png] > <https://www.apache.org/legal/generative-tooling.html#:~:text=Given%20the%20above,code%20scanning%20results> > > > Ariel shared this at the start. Right now we must know what tool was used > so we can make sure its license is ok. The only tool currently flagged as > not acceptable is OpenAI as it has wordings limiting what you may do with > its output. > > Sent from my iPhone > > On Jul 23, 2025, at 1:31 PM, Jon Haddad <j...@rustyrazorblade.com> wrote: > > > +1 to Patrick's proposal. > > On Wed, Jul 23, 2025 at 12:37 PM Patrick McFadin <pmcfa...@gmail.com> > wrote: > > I just did some review on all the case law around copywrite and AI code. > So far, every claim has been dismissed. There are some other cases like > NYTimes which have more merit and are proceeding. > > Which leads me to the opinion that this is feeling like a premature > optimization. Somebody creating a PR should not have to also submit a SBOM, > which is essentially what we’re asking. It’s undue burden and friction on > the process when we should be looking for ways to reduce friction. > > My proposal is no disclosures required. > > On Wed, Jul 23, 2025 at 12:06 PM Yifan Cai <yc25c...@gmail.com> wrote: > > According to the thread, the disclosure is for legal purposes. For > example, the patch is not produced by OpenAI's service. I think having the > discussion to clarify the AI usage in the projects is meaningful. I guess > many are hesitating because of the unclarity in the area. > > > I don’t believe or agree with us assuming we should do this for every PR > > I am with you, David. Updating the mail list for PRs is overwhelming for > both the author and the community. > > I also do not feel co-author is the best place. > > - Yifan > > On Wed, Jul 23, 2025 at 11:51 AM Patrick McFadin <pmcfa...@gmail.com> > wrote: > > This is starting to get ridiculous. Disclosure statements on exactly how a > problem was solved? What’s next? Time cards? > > It’s time to accept the world as it is. AI is in the coding toolbox now > just like IDEs, linters and code formatters. Some may not like using them, > some may love using them. What matters is that a problem was solved, the > code matches whatever quality standard the project upholds which should be > enforced by testing and code reviews. > > Patrick > > On Wed, Jul 23, 2025 at 11:31 AM David Capwell <dcapw...@apple.com> wrote: > > David is disclosing it in the maillist and the GH page. Should the > disclosure be persisted in the commit? > > > Someone asked me to update the ML, but I don’t believe or agree with us > assuming we should do this for every PR; personally storing this in the PR > description is fine to me as you are telling the reviewers (who you need to > communicate this to). > > > I’d say we can use the co-authored part of our commit messages to disclose > the actual AI that was used? > > > Heh... I kinda feel dirty doing that… No one does that when they take > something from a blog or stack overflow, but when you do that you should > still attribute by linking… which I guess is what Co-Authored does? > > I don’t know… feels dirty... > > > On Jul 23, 2025, at 11:19 AM, Bernardo Botella < > conta...@bernardobotella.com> wrote: > > That’s a great point. I’d say we can use the co-authored part of our > commit messages to disclose the actual AI that was used? > > > > On Jul 23, 2025, at 10:57 AM, Yifan Cai <yc25c...@gmail.com> wrote: > > Curious, what are the good ways to disclose the information? > > > All of which comes back to: if people disclose if they used AI, what > models, and whether they used the code or text the model wrote verbatim or > used it as a scaffolding and then heavily modified everything I think we'll > be in a pretty good spot. > > David is disclosing it in the maillist and the GH page. Should the > disclosure be persisted in the commit? > > - Yifan > > On Wed, Jul 23, 2025 at 8:47 AM David Capwell <dcapw...@apple.com> wrote: > > Sent out this patch that was written 100% by Claude: > https://github.com/apache/cassandra/pull/4266 > > Claudes license doesn’t have issues with the current ASF policy as far as > I can tell. If you look at the patch it’s very clear there isn’t any > copywriter material (its glueing together C* classes). > > I could have written this my self but I had to focus on code reviews and > also needed this patch out, so asked Claude to write it for me so I could > focus on reviews. I have reviewed it myself and it’s basically the same > code I would have written (notice how small and focused the patch is, > larger stuff doesn’t normally pass my peer review). > > On Jun 25, 2025, at 2:37 PM, David Capwell <dcapw...@apple.com> wrote: > > +1 to what Josh said > Sent from my iPhone > > On Jun 25, 2025, at 1:18 PM, Josh McKenzie <jmcken...@apache.org> wrote: > > > Did some more digging. Apparently the way a lot of headline-grabbers have > been making models reproduce code verbatim is to prompt them with dozens of > verbatim tokens of copyrighted code as input where completion is then very > heavily weighted to regurgitate the initial implementation. Which makes > sense; if you copy/paste 100 lines of copyrighted code, the statistically > likely completion for that will be that initial implementation. > > For local LLM's, the likelihood of verbatim reproduction is *differently* but > apparently comparably unlikely because they have far fewer parameters (32B > vs. 671B for Deepseek for instance) of their pre-training corpus of > trillions (30T in the case of Qwen3-32B for instance), so the individual > tokens from the copyrighted material are highly unlikely to be actually > *stored* in the model to be reproduced, and certainly not in sequence. > They don't have the post-generation checks claimed by the SOTA models, but > are apparently considered in the "< 1 in 10,000 completions will generate > copyrighted code" territory. > > When asked a human language prompt, or a multi-agent pipelined "still > human language but from your architect agent" prompt, the likelihood of > producing a string of copyrighted code in that manner is statistically > very, very low. I think we're at far more risk of contributors copy/pasting > stack overflow or code from other projects than we are from modern genAI > models producing blocks of copyrighted code. > > All of which comes back to: if people disclose if they used AI, what > models, and whether they used the code or text the model wrote verbatim or > used it as a scaffolding and then heavily modified everything I think we'll > be in a pretty good spot. > > On Wed, Jun 25, 2025, at 12:47 PM, David Capwell wrote: > > > 2. Models that do not do output filtering to restrict the reproduction of > training data unless the tool can ensure the output is license compatible? > > 2 would basically prohibit locally run models. > > > I am not for this for the reasons listed above. There isn’t a difference > between this and a contributor copying code and sending our way. We still > need to validate the code can be accepted . > > We also have the issue of having this be a broad stroke. If the user asked > a model to write a test for the code the human wrote, we reject the > contribution as they used a local model? This poses very little copywriting > risk yet our policy would now reject > > Sent from my iPhone > > On Jun 25, 2025, at 9:10 AM, Ariel Weisberg <ar...@weisberg.ws> wrote: > > 2. Models that do not do output filtering to restrict the reproduction of > training data unless the tool can ensure the output is license compatible? > > 2 would basically prohibit locally run models. > > > >