Hi, regarding your analysis: I think you could just scan the .changes files as they list all *.deb files uploaded. Though very old changes only have MD5 hashes.
They can be found in file://mirror.ftp-master.debian.org/srv/ftp-master.debian.org/queue/done Regarding your observation regarding bash not showing up in any Packages index: that can happen for (at least) two reasons. The snapshot service does not retrieve all Packages files. Or the package could have been superseded by a newer version before it was ever published in a dinstall run. Regards, Ansgar On Wed, 2025-04-02 at 15:26 +0200, Johannes Schauer Marin Rodrigues wrote: > Hi, > > On Thu, 30 May 2024 14:26:31 +0000 Holger Levsen > <hol...@layer-acht.org> wrote: > > very "nice" find, josch! > > with the help of Holger and osuosl4 I have dug into this a bit more > and tried > to get some hard data about this problem. My idea was the following: > parse all > Packages files for all suites, all architectures and all components > for all > timestamps stored on snapshot.d.o and find packages with the same > name/arch/version tuple that have a different checksum. To this end, > I slightly > (less than 1000 lines of diff) patched the tooling at > https://salsa.debian.org/metasnap-team/metasnap.git with the patch > that I > attached to this mail on top of > 1dadf2575160caf9467c4e21aa6c0a31ac10ffc2. > > After running that script for 3 months and downloading 189 GB of data > in 3.5 > Million requests (about 2 seconds for every request), we had a > database > (actually a git repository) of 48 GB that we can use to find > duplicates. It > took another 2 months to go through that data. I attached a graph > which shows > the number of duplicate name/arch/version triplets per timestamp. > Please note > the logarithmic y-axis. The total number of duplicates from 2005 > until 2024 is > 334335. > > Problem solved? Not so fast. Processing all Packages files will *not* > find the > original problem with bash. Why? Because according to the Packages > files from > snapshot.debian.org only one version of bash:arm64=5.2.15-2+b3 > exists, namely: > > MD5sum: 01ee4cfa3df78e7ff0dc156ff19e2c88 > SHA1:  1a0b12419b69a983bf22ac1d3d9f8bec725487b1 > SHA256: > 828ce0b4445921fff5b6394e74cce8296f3038d559845a3e82435b55ca6fcaa8 > > The other version never ended up in a Packages file even though it > was found in > the /pool/main/b/bash directory in the snapshot of 2023-07-13 > 21:11:09 nearly > one year before the other version popped up. > > How can a package be in the pool directory but not in a Packages > file? No idea > but it shows that my method from above does not find a certain class > of > problems. We could find those by creating a fitting query against the > snapshot.d.o database. Apparently lw07 is DD accessible and has a > snapshot-guest service. So this is on my TODO list and Nicolas > Dandrimont > already offered to help with constructing an appropriate SQL query > during > MiniDebConf Hamburg this year. > > Lastly there is the problem of packages in incoming. Those packages > will be > used to build other packages that end up in the archive but they > might never > end up in the archive themselves. Thus, we might never know whether > one of > these packages violated the idea that the > packagename/architecture/version > triplet uniquely identifies a Debian binary package in the archive... > > Thanks! > > cheers, josch