Does "*optionally* disclose the LLM used in whatever way you prefer and
*definitely* no OpenAI" meet everyone's expectations?

- Yifan

On Thu, Jul 31, 2025 at 1:56 PM Josh McKenzie <jmcken...@apache.org> wrote:

> Do we have a consensus on this topic or is there still further discussion
> to be had?
>
> On Thu, Jul 24, 2025, at 8:26 AM, David Capwell wrote:
>
> Given the above, code generated in whole or in part using AI can be
> contributed if the contributor ensures that: The terms and conditions of
> the generative AI tool do not place any restrictions on use of the output
> that would be inconsistent with the Open Source Definition. At least one of
> the following conditions is met: The output is not copyrightable subject
> matter (and would not be even if produced by a human). No third party
> materials are included in the output. Any third party materials that are
> included in the output are being used with permission (e.g., under a
> compatible open-source license) of the third party copyright holders and in
> compliance with the applicable license terms. A contributor obtains
> reasonable certainty that conditions 2.2 or 2.3 are met if the AI tool
> itself provides sufficient information about output that may be similar to
> training data, or from code scanning results
> ASF Generative Tooling Guidance
> <https://www.apache.org/legal/generative-tooling.html#:~:text=Given%20the%20above,code%20scanning%20results>
> apache.org
> <https://www.apache.org/legal/generative-tooling.html#:~:text=Given%20the%20above,code%20scanning%20results>
> [image: apple-touch-icon-180x180.png]
> <https://www.apache.org/legal/generative-tooling.html#:~:text=Given%20the%20above,code%20scanning%20results>
>
>
> Ariel shared this at the start.  Right now we must know what tool was used
> so we can make sure its license is ok.  The only tool currently flagged as
> not acceptable is OpenAI as it has wordings limiting what you may do with
> its output.
>
> Sent from my iPhone
>
> On Jul 23, 2025, at 1:31 PM, Jon Haddad <j...@rustyrazorblade.com> wrote:
>
> 
> +1 to Patrick's proposal.
>
> On Wed, Jul 23, 2025 at 12:37 PM Patrick McFadin <pmcfa...@gmail.com>
> wrote:
>
> I just did some review on all the case law around copywrite and AI code.
> So far, every claim has been dismissed. There are some other cases like
> NYTimes which have more merit and are proceeding.
>
> Which leads me to the opinion that this is feeling like a premature
> optimization. Somebody creating a PR should not have to also submit a SBOM,
> which is essentially what we’re asking. It’s undue burden and friction on
> the process when we should be looking for ways to reduce friction.
>
> My proposal is no disclosures required.
>
> On Wed, Jul 23, 2025 at 12:06 PM Yifan Cai <yc25c...@gmail.com> wrote:
>
> According to the thread, the disclosure is for legal purposes. For
> example, the patch is not produced by OpenAI's service. I think having the
> discussion to clarify the AI usage in the projects is meaningful. I guess
> many are hesitating because of the unclarity in the area.
>
> > I don’t believe or agree with us assuming we should do this for every PR
>
> I am with you, David. Updating the mail list for PRs is overwhelming for
> both the author and the community.
>
> I also do not feel co-author is the best place.
>
> - Yifan
>
> On Wed, Jul 23, 2025 at 11:51 AM Patrick McFadin <pmcfa...@gmail.com>
> wrote:
>
> This is starting to get ridiculous. Disclosure statements on exactly how a
> problem was solved? What’s next? Time cards?
>
> It’s time to accept the world as it is. AI is in the coding toolbox now
> just like IDEs, linters and code formatters. Some may not like using them,
> some may love using them. What matters is that a problem was solved, the
> code matches whatever quality standard the project upholds which should be
> enforced by testing and code reviews.
>
> Patrick
>
> On Wed, Jul 23, 2025 at 11:31 AM David Capwell <dcapw...@apple.com> wrote:
>
> David is disclosing it in the maillist and the GH page. Should the
> disclosure be persisted in the commit?
>
>
> Someone asked me to update the ML, but I don’t believe or agree with us
> assuming we should do this for every PR; personally storing this in the PR
> description is fine to me as you are telling the reviewers (who you need to
> communicate this to).
>
>
> I’d say we can use the co-authored part of our commit messages to disclose
> the actual AI that was used?
>
>
> Heh... I kinda feel dirty doing that… No one does that when they take
> something from a blog or stack overflow, but when you do that you should
> still attribute by linking… which I guess is what Co-Authored does?
>
> I don’t know… feels dirty...
>
>
> On Jul 23, 2025, at 11:19 AM, Bernardo Botella <
> conta...@bernardobotella.com> wrote:
>
> That’s a great point. I’d say we can use the co-authored part of our
> commit messages to disclose the actual AI that was used?
>
>
>
> On Jul 23, 2025, at 10:57 AM, Yifan Cai <yc25c...@gmail.com> wrote:
>
> Curious, what are the good ways to disclose the information?
>
> > All of which comes back to: if people disclose if they used AI, what
> models, and whether they used the code or text the model wrote verbatim or
> used it as a scaffolding and then heavily modified everything I think we'll
> be in a pretty good spot.
>
> David is disclosing it in the maillist and the GH page. Should the
> disclosure be persisted in the commit?
>
> - Yifan
>
> On Wed, Jul 23, 2025 at 8:47 AM David Capwell <dcapw...@apple.com> wrote:
>
> Sent out this patch that was written 100% by Claude:
> https://github.com/apache/cassandra/pull/4266
>
> Claudes license doesn’t have issues with the current ASF policy as far as
> I can tell.  If you look at the patch it’s very clear there isn’t any
> copywriter material (its glueing together C* classes).
>
> I could have written this my self but I had to focus on code reviews and
> also needed this patch out, so asked Claude to write it for me so I could
> focus on reviews.  I have reviewed it myself and it’s basically the same
> code I would have written (notice how small and focused the patch is,
> larger stuff doesn’t normally pass my peer review).
>
> On Jun 25, 2025, at 2:37 PM, David Capwell <dcapw...@apple.com> wrote:
>
> +1 to what Josh said
> Sent from my iPhone
>
> On Jun 25, 2025, at 1:18 PM, Josh McKenzie <jmcken...@apache.org> wrote:
>
> 
> Did some more digging. Apparently the way a lot of headline-grabbers have
> been making models reproduce code verbatim is to prompt them with dozens of
> verbatim tokens of copyrighted code as input where completion is then very
> heavily weighted to regurgitate the initial implementation. Which makes
> sense; if you copy/paste 100 lines of copyrighted code, the statistically
> likely completion for that will be that initial implementation.
>
> For local LLM's, the likelihood of verbatim reproduction is *differently* but
> apparently comparably unlikely because they have far fewer parameters (32B
> vs. 671B for Deepseek for instance) of their pre-training corpus of
> trillions (30T in the case of Qwen3-32B for instance), so the individual
> tokens from the copyrighted material are highly unlikely to be actually
> *stored* in the model to be reproduced, and certainly not in sequence.
> They don't have the post-generation checks claimed by the SOTA models, but
> are apparently considered in the "< 1 in 10,000 completions will generate
> copyrighted code" territory.
>
> When asked a human language prompt, or a multi-agent pipelined "still
> human language but from your architect agent" prompt, the likelihood of
> producing a string of copyrighted code in that manner is statistically
> very, very low. I think we're at far more risk of contributors copy/pasting
> stack overflow or code from other projects than we are from modern genAI
> models producing blocks of copyrighted code.
>
> All of which comes back to: if people disclose if they used AI, what
> models, and whether they used the code or text the model wrote verbatim or
> used it as a scaffolding and then heavily modified everything I think we'll
> be in a pretty good spot.
>
> On Wed, Jun 25, 2025, at 12:47 PM, David Capwell wrote:
>
>
> 2. Models that do not do output filtering to restrict the reproduction of
> training data unless the tool can ensure the output is license compatible?
>
> 2 would basically prohibit locally run models.
>
>
> I am not for this for the reasons listed above. There isn’t a difference
> between this and a contributor copying code and sending our way. We still
> need to validate the code can be accepted .
>
> We also have the issue of having this be a broad stroke. If the user asked
> a model to write a test for the code the human wrote, we reject the
> contribution as they used a local model? This poses very little copywriting
> risk yet our policy would now reject
>
> Sent from my iPhone
>
> On Jun 25, 2025, at 9:10 AM, Ariel Weisberg <ar...@weisberg.ws> wrote:
>
> 2. Models that do not do output filtering to restrict the reproduction of
> training data unless the tool can ensure the output is license compatible?
>
> 2 would basically prohibit locally run models.
>
>
>
>

Reply via email to