> I have used diff -r in the past, and As mentioned above - reproducibility of artifact generation (whether those are source tarballs, or any other kind of artifacts that are result of "convenience/binary" artifact preparation) is only reproducibility. Many of the transformations are one-way so results cannot be easily compared to the sources and - also similar to cryptography asymmetric keys where you have to find two prime numbers from the multiplication -> it's way easier to prove that the result is reproducible by running the same transformations and comparing the result. That's pure math.
> some editors can show recursive directory differences. The "reproducible-builds.org" produced a fantastic tool for precisely this - comparing two artifacts and reporting differences in an easy to reason way even if they are not bit-to-bit reproducible. It's called diffoscope and it's super easy to install and run and produced all kinds of outputs - both human and automation readable. https://diffoscope.org/ On Sat, Nov 22, 2025 at 5:34 PM Jarek Potiuk <[email protected]> wrote: > > does the source code in the tarball match what is announced as the git > commit. If there is a pre-existing tool that does that check, I'd love to > use it. > > Actually this is something that only reproducibility checks can reliably > tell. Usually during preparation of those releases some transformations are > done (compiling stuff, transpiling, generating metadata and so on) so the > only way you can actually verify it is reproducibility. Someone (PMC > member) who is verifying the release should be able to prepare the same > release and compare that they are the same (ideally reproducible > bit-by-bit) - we discussed a lot about it in ATR slack and security-discuss > and this is rather something that each project will have to do on their own > (we do in Airflow) - i.e. to have instructions on how to verify the release > - one of those steps are "please recreate the package and check if it is > the same as the one you are voting on). This is something that ATR might > make easier, and run their own "rebuild and check" eventually - but the > safest way is to make your artifacts reproducible as instructions to your > PMC members - and it's just describing "How do I produce PMC reproducible > builds". > > We have this nice page > https://cwiki.apache.org/confluence/display/SECURITY/Reproducible+Builds > - where we gather best ASF practices for reproducibility - also there is > link to "reproducible-builds.org" https://reproducible-builds.org/docs/ > that has wealth of information and recipes for various languages. I've been > to the Vienna Reproducible Builds Summit 3 weeks ago and I think we are > getting close to having reproducibility practices spread through the > ecosystem - also reproducible build is one of the conditions that will > allow you to fully automate ATR builds from CI. ATR has CLI, APIs and > GitHub Actions, that will allow you to do all kind of things - publish > your artifacts to ATR start voting, but also submit your artifacts to PyPI > and NPM via Trusted Publishing automatically from your release workflows in > CI - but this has one specific condition: your builds will have to be > reproducible and your PMC members when voting will have to confirm that the > artifacts produced automatically are reproducible by them. > > Using ATR will allow us (we already do) to the ASF data - OID for > attestation, signing, and publishing to 3rd-party registries, but also > access to trusted committer and PMC database to know who is doing what > (like binding/non-binding votes) etc. etc. And yes I think very soon we > (ASF) will be adding, documenting and implementing more and more common > practices around release artifacts preparation - both procedural and > technological (more cryptographic attestations, storing information about > build environment in a cryptographically secure way and producing > cryptographically verifiable attestations that 3rd-parties will be able to > store on ledgers and other 3rd-parties will be able to independently > verify). > > All this is currently very actively discussed and being implemented - in > "trusted-releases" and "security-discuss" mailing lists and slack channels. > So I would love to bring anyone's attention that likely those discussion > should happen there, because it's very likely (if not certain due to its > "board commissioned it with the tooling team and funded the work" status. > > > One extra point that is worth mentioning. On several occasions, I’ve > seen automation give a false sense of security. A tool reports everything > as clean, and people assume the release is fine when it is not. It’s only > when humans look deeper that a serious issue is discovered. For example, a > mention of a GPL license can be fine, depending on the context, and > automation is unlikely to detect it. > > Absolutely. 100% agree and this is something we usually keep on > discovering every now and then. This should **never** be removed from the > picture. Even now we have two independent licence checks in ATR - one with > RAT and one custom written by Sean, and the side effect is that they do > **currently** sometimes detect different licensing issues. And I am sure > one of the things in RAT we will do is asking (and performing that by the > PMC) an occasional "manual" verification to periodically check things > manually. One interesting point is that I think **both** should be > happening and we need to figure out how to make the automation in the way > that we either remove or actively "counteract" the "false sense of > confidence". There are many ways this can be done, for example by injecting > deliberate errors in the process or automation of reminders (super-simple > thing - Apple keeps on reminding me to manually verify my phone number > every few months, just in case I changed it. Having a single, centralised > release tool gives us the opportunity of iterating and improving on the > process, and will give us (the ASF) a way to have a step in the process > where we will be able to "inject" all kinds of behaviour-changing processes > and experiment with them. I think we will finally have.a chance to not only > tell our PMCS (and PPMCs) on how to do the releases, but also more actively > monitor it and - more importantly - influence it way more efficiently and > enforceable. > > J. > > > > On Sat, Nov 22, 2025 at 5:07 PM Jarek Potiuk <[email protected]> wrote: > >> I think even if ATR does not **currently** support more checks than >> **basic** checks for binary releases, there is absolutely nothing wrong in >> adding them there. ATR will (hopefully) be one of the most common used tool >> in the ASF, and we have tooling team that supports developing and >> maintenance of it, also all the code is super-easy-python code using modern >> standards, uv to run the tooling and if anyone would like to contribute a >> check for certain artifact types - like PyPI Rcs, I am 100% sure Sean and >> Dava and others who are already contributing and adding issues and tools, >> will be super happy to accept. >> >> What my post was mostly about to suggest is that very soon we will have a >> common "platform" for release verification - we (ASF) already do basic >> checks with ATR on our binary artifacts, we already use RAT from creadur >> mentioned above for licence checking and there is **absolutely no reason** >> anyone here could not add a new check there - I am sure contributions will >> be very welcome there. My cooperation with the tooling time has been >> nothing-but-stellar. >> >> So my main point is that if there are ideas how to improve this "common >> platform" we are going to have which is already plugging in our release >> process - they are absolutely welcome, but Ideally they should be added to >> ATR, rather than developed separately. It could also be - of course - >> developed separately in creadur (like RAT is) and used in ATR, but I think >> having those checks integrated with ATR is all-but-guarantee that it's >> going to be useful across the whole ASF. >> >> That's all I wanted to stress. I feel a bit defensive approach when I >> mentioned ATR, but that was more "Hey - we have this great platform for >> releases which is already funded by Alpha-Omega, and driven by board >> decision, so we should rather work on strenghtening it and adding things to >> something that is **precisely** targeting to automate the workflow that has >> been mentioned here that one that **is in a need of automation**. >> >> Yes, it is, and we have an ASF-wide effort to improve exactly that >> workflow that the board not only recognised and secured funds for and >> staffed, but also (in a recent conversation with some board members) have >> been named as the absolute game-changer for the ASF (which I 100% agree >> with). >> >> So ... let's do it as a combined effort - as simple as that :) . >> >> J. >> >> >> >> On Sat, Nov 22, 2025 at 3:20 PM sebb <[email protected]> wrote: >> >>> On Sat, 22 Nov 2025 at 14:03, PJ Fanning <[email protected]> wrote: >>> > >>> > My issue is not really about the source release and there is some >>> > tooling and typically the review checks are to be done at vote time. >>> > Here is a check that might be useful to automate and that can't be >>> > properly done without it - does the source code in the tarball match >>> > what is announced as the git commit. If there is a pre-existing tool >>> > that does that check, I'd love to use it. >>> >>> I agree that this is vital, as the tarballs are generally created from >>> whatever happens to be in the source directories. >>> It's very easy for spurious files to be added to the tarball, e.g. >>> files left over from testing. >>> An exact match is not necessary, so long as every file in the source >>> tarball can be derived from the source tag. >>> >>> I have used diff -r in the past, and some editors can show recursive >>> directory differences. >>> >>> > My issue is really with the convenience binaries. Are reviewers really >>> > unzipping jar files to check the contents and checking the text in the >>> > pom files? >>> > >>> > What format are the pypi RCs supposed to be in? Are we sure that the >>> > apache prefix appears in the target pypi project? >>> > >>> > And the big binary tarballs that some teams ship, full of jars or >>> > other compiled components? Those can be a real time consumer to >>> > manually review. >>> > >>> > Some reviewers do these convenience binary checks and maybe it's my >>> > bad luck to try checking on votes but I see a lot of issues when I >>> > review convenience binaries. >>> > >>> > >>> > >>> > On Sat, 22 Nov 2025 at 14:49, tison <[email protected]> wrote: >>> > > >>> > > > a mention of a GPL license can be fine >>> > > >>> > > Typically, you'd end up with an allow list, like [1][2] >>> > > >>> > > [1] >>> https://github.com/apache/flink/blob/d0c9ed9ff47cd0f0fae62958521a0b18e5cd9bf3/tools/ci/flink-ci-tools/src/main/java/org/apache/flink/tools/ci/licensecheck/JarFileChecker.java#L194-L260 >>> > > [2] >>> https://github.com/apache/opendal/blob/c35da0d92442756d5742eaf70a2259dd23621b53/deny.toml#L28-L48 >>> > > >>> > > Best, >>> > > tison. >>> > > >>> > > <[email protected]> 于2025年11月22日周六 21:44写道: >>> > > > >>> > > > Hi, >>> > > > >>> > > > One extra point that is worth mentioning. On several occasions, >>> I’ve seen automation give a false sense of security. A tool reports >>> everything as clean, and people assume the release is fine when it is not. >>> It’s only when humans look deeper that a serious issue is discovered. For >>> example, a mention of a GPL license can be fine, depending on the context, >>> and automation is unlikely to detect it. >>> > > > >>> > > > Kind Regards. >>> > > > >>> > > > Justin >>> > > >>> > > --------------------------------------------------------------------- >>> > > To unsubscribe, e-mail: [email protected] >>> > > For additional commands, e-mail: [email protected] >>> > > >>> > >>> > --------------------------------------------------------------------- >>> > To unsubscribe, e-mail: [email protected] >>> > For additional commands, e-mail: [email protected] >>> > >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: [email protected] >>> For additional commands, e-mail: [email protected] >>> >>>
