Re: shared tools to validate convenience binaries and artifacts

Jarek Potiuk Sat, 22 Nov 2025 08:59:41 -0800

> I have used diff -r in the past, and
As mentioned above - reproducibility of artifact generation (whether those
are source tarballs, or any other kind of artifacts that are result of
"convenience/binary" artifact preparation) is only reproducibility. Many of
the transformations are one-way so results cannot be easily compared to the
sources and - also similar to cryptography asymmetric keys where you have
to find two prime numbers from the multiplication  -> it's way easier to
prove that the result is reproducible by running the same transformations
and comparing the result. That's pure math.


> some editors can show recursive directory differences.

The "reproducible-builds.org" produced a fantastic tool for precisely this
- comparing two artifacts and reporting differences in an easy to reason
way even if they are not bit-to-bit reproducible. It's called diffoscope
and it's super easy to install and run and produced all kinds of outputs -
both human and automation readable.

https://diffoscope.org/




On Sat, Nov 22, 2025 at 5:34 PM Jarek Potiuk <[email protected]> wrote:

> > does the source code in the tarball match what is announced as the git
> commit. If there is a pre-existing tool that does that check, I'd love to
> use it.
>
> Actually this is something that only reproducibility checks can reliably
> tell. Usually during preparation of those releases some transformations are
> done (compiling stuff, transpiling, generating metadata and so on) so the
> only way you can actually verify it is reproducibility.  Someone (PMC
> member) who is verifying  the release should be able to prepare the same
> release and compare that they are the same (ideally reproducible
> bit-by-bit) - we discussed a lot about it in ATR slack and security-discuss
> and this is rather something that each project will have to do on their own
> (we do in Airflow) - i.e. to have instructions on how to verify the release
> - one of those steps are "please recreate the package and check if it is
> the same as the one you are voting on). This is something that ATR might
> make easier, and run their own "rebuild and check" eventually - but the
> safest way is to make your artifacts reproducible as instructions to your
> PMC members - and it's just describing "How do I produce PMC reproducible
> builds".
>
> We have this nice page
> https://cwiki.apache.org/confluence/display/SECURITY/Reproducible+Builds
> - where we gather best ASF practices for reproducibility - also there is
> link to "reproducible-builds.org" https://reproducible-builds.org/docs/
> that has wealth of information and recipes for various languages. I've been
> to the Vienna Reproducible Builds Summit 3 weeks ago and I think we are
> getting close to having reproducibility practices spread through the
> ecosystem - also reproducible build is one of the conditions that will
> allow you to fully automate ATR builds from CI. ATR has CLI, APIs and
> GitHub Actions,  that will allow you to do all kind of things - publish
> your artifacts to ATR start voting, but also submit your artifacts to PyPI
> and NPM via Trusted Publishing automatically from your release workflows in
> CI - but this has one specific condition: your builds will have to be
> reproducible and your PMC members when voting will have to confirm that the
> artifacts produced automatically are reproducible by them.
>
> Using ATR will allow us (we already do) to the ASF data - OID for
> attestation, signing, and publishing to 3rd-party registries, but also
> access to trusted committer and PMC database to know who is doing what
> (like binding/non-binding votes) etc. etc. And yes I think very soon we
> (ASF) will be adding, documenting and implementing more and more common
> practices around release artifacts preparation - both procedural and
> technological (more cryptographic attestations, storing information about
> build environment in a cryptographically secure way and producing
> cryptographically verifiable attestations that 3rd-parties will be able to
> store on ledgers and other 3rd-parties will be able to independently
> verify).
>
> All this is currently very actively discussed and being implemented - in
> "trusted-releases" and "security-discuss" mailing lists and slack channels.
> So I would love to bring anyone's attention that likely those discussion
> should happen there, because it's very likely (if not certain due to its
> "board commissioned it with the tooling team and funded the work" status.
>
> > One extra point that is worth mentioning. On several occasions, I’ve
> seen automation give a false sense of security. A tool reports everything
> as clean, and people assume the release is fine when it is not. It’s only
> when humans look deeper that a serious issue is discovered. For example, a
> mention of a GPL license can be fine, depending on the context, and
> automation is unlikely to detect it.
>
> Absolutely. 100% agree and this is something we usually keep on
> discovering every now and then. This should **never** be removed from the
> picture. Even now we have two independent licence checks in ATR - one with
> RAT and one custom written by Sean, and the side effect is that they do
> **currently** sometimes detect different licensing issues. And I am sure
> one of the things in RAT we will do is asking (and performing that by the
> PMC) an occasional "manual" verification to periodically check things
> manually. One interesting point is that I think **both** should be
> happening and we need to figure out how to make the automation in the way
> that we either remove or actively "counteract" the "false sense of
> confidence". There are many ways this can be done, for example by injecting
> deliberate errors in the process or automation of reminders (super-simple
> thing - Apple keeps on reminding me to manually verify my phone number
> every few months, just in case I changed it. Having a single, centralised
> release tool gives us the opportunity of iterating and improving on the
> process, and will give us (the ASF) a way to have a step in the process
> where we will be able to "inject" all kinds of behaviour-changing processes
> and experiment with them. I think we will finally have.a chance to not only
> tell our PMCS (and PPMCs) on how to do the releases, but also more actively
> monitor it and - more importantly - influence it way more efficiently and
> enforceable.
>
> J.
>
>
>
> On Sat, Nov 22, 2025 at 5:07 PM Jarek Potiuk <[email protected]> wrote:
>
>> I think even if ATR does not **currently** support more checks than
>> **basic** checks for binary releases, there is absolutely nothing wrong in
>> adding them there. ATR will (hopefully) be one of the most common used tool
>> in the ASF, and we have tooling team that supports developing and
>> maintenance of it, also all the code is super-easy-python code using modern
>> standards, uv to run the tooling and if anyone would like to contribute a
>> check for certain artifact types - like PyPI Rcs, I am 100% sure Sean and
>> Dava and others who are already contributing and adding issues and tools,
>> will be super happy to accept.
>>
>> What my post was mostly about to suggest is that very soon we will have a
>> common "platform" for release verification - we (ASF) already do basic
>> checks with ATR on our binary artifacts, we already use RAT from creadur
>> mentioned above for licence checking and there is **absolutely no reason**
>> anyone here could not add a new check there - I am sure contributions will
>> be very welcome there. My cooperation with the tooling time has been
>> nothing-but-stellar.
>>
>> So my main point is that if there are ideas how to improve this "common
>> platform" we are going to have which is already plugging in our release
>> process - they are absolutely welcome, but Ideally they should be added to
>> ATR, rather than developed separately. It could also be - of course -
>> developed separately in creadur (like RAT is) and used in ATR, but I think
>> having those checks integrated with ATR is all-but-guarantee that it's
>> going to be useful across the whole ASF.
>>
>> That's all I wanted to stress. I feel a bit defensive approach when I
>> mentioned ATR, but that was more "Hey - we have this great platform for
>> releases which is already funded by Alpha-Omega, and driven by board
>> decision, so we should rather work on strenghtening it and adding things to
>> something that is **precisely** targeting to automate the workflow that has
>> been mentioned here that one that **is in a need of automation**.
>>
>> Yes, it is, and we have an ASF-wide effort to improve exactly that
>> workflow that the board not only recognised and secured funds for and
>> staffed, but also (in a recent conversation with some board members) have
>> been named as the absolute game-changer for the ASF (which I 100% agree
>> with).
>>
>> So ... let's do it as a combined effort - as simple as that :) .
>>
>> J.
>>
>>
>>
>> On Sat, Nov 22, 2025 at 3:20 PM sebb <[email protected]> wrote:
>>
>>> On Sat, 22 Nov 2025 at 14:03, PJ Fanning <[email protected]> wrote:
>>> >
>>> > My issue is not really about the source release and there is some
>>> > tooling and typically the review checks are to be done at vote time.
>>> > Here is a check that might be useful to automate and that can't be
>>> > properly done without it - does the source code in the tarball match
>>> > what is announced as the git commit. If there is a pre-existing tool
>>> > that does that check, I'd love to use it.
>>>
>>> I agree that this is vital, as the tarballs are generally created from
>>> whatever happens to be in the source directories.
>>> It's very easy for spurious files to be added to the tarball, e.g.
>>> files left over from testing.
>>> An exact match is not necessary, so long as every file in the source
>>> tarball can be derived from the source tag.
>>>
>>> I have used diff -r in the past, and some editors can show recursive
>>> directory differences.
>>>
>>> > My issue is really with the convenience binaries. Are reviewers really
>>> > unzipping jar files to check the contents and checking the text in the
>>> > pom files?
>>> >
>>> > What format are the pypi RCs supposed to be in? Are we sure that the
>>> > apache prefix appears in the target pypi project?
>>> >
>>> > And the big binary tarballs that some teams ship, full of jars or
>>> > other compiled components? Those can be a real time consumer to
>>> > manually review.
>>> >
>>> > Some reviewers do these convenience binary checks and maybe it's my
>>> > bad luck to try checking on votes but I see a lot of issues when I
>>> > review convenience binaries.
>>> >
>>> >
>>> >
>>> > On Sat, 22 Nov 2025 at 14:49, tison <[email protected]> wrote:
>>> > >
>>> > > > a mention of a GPL license can be fine
>>> > >
>>> > > Typically, you'd end up with an allow list, like [1][2]
>>> > >
>>> > > [1]
>>> https://github.com/apache/flink/blob/d0c9ed9ff47cd0f0fae62958521a0b18e5cd9bf3/tools/ci/flink-ci-tools/src/main/java/org/apache/flink/tools/ci/licensecheck/JarFileChecker.java#L194-L260
>>> > > [2]
>>> https://github.com/apache/opendal/blob/c35da0d92442756d5742eaf70a2259dd23621b53/deny.toml#L28-L48
>>> > >
>>> > > Best,
>>> > > tison.
>>> > >
>>> > > <[email protected]> 于2025年11月22日周六 21:44写道：
>>> > > >
>>> > > > Hi,
>>> > > >
>>> > > > One extra point that is worth mentioning. On several occasions,
>>> I’ve seen automation give a false sense of security. A tool reports
>>> everything as clean, and people assume the release is fine when it is not.
>>> It’s only when humans look deeper that a serious issue is discovered. For
>>> example, a mention of a GPL license can be fine, depending on the context,
>>> and automation is unlikely to detect it.
>>> > > >
>>> > > > Kind Regards.
>>> > > >
>>> > > > Justin
>>> > >
>>> > > ---------------------------------------------------------------------
>>> > > To unsubscribe, e-mail: [email protected]
>>> > > For additional commands, e-mail: [email protected]
>>> > >
>>> >
>>> > ---------------------------------------------------------------------
>>> > To unsubscribe, e-mail: [email protected]
>>> > For additional commands, e-mail: [email protected]
>>> >
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: [email protected]
>>> For additional commands, e-mail: [email protected]
>>>
>>>

Re: shared tools to validate convenience binaries and artifacts

Reply via email to