Hi,

I would be interested in per-package-and-version download statistics and trends as well.

Le 2025-05-03 09:28, Philipp Kern a écrit :

The problem is that we currently do not want to retain this data.

You're absolutely right here, there is no point in retaining the raw data, it gets stale pretty fast anyway. It has to be processed with minimal delay and then fed into some kind of time-series database.

It'd require a clear measure of usefulness, not just a "it would be nice if we had it". And there would need to be actual criteria of what we would be interested in. Raw download count? Some measure of bucketing by source IP or not? What about container/hermetic builders fetching the same ancient package over and over again from snapshot? Does the version matter?

It would help (as an additional data input) when having to make decisions about keeping or removing packages, especially those with very low popcons. I would also expect the download counts to have an interesting significance (for the sake of estimating the installed base) right after releasing a package update.

Having the count of total successful downloads and the count of unique IPs for a given package+version couple (= URI) within a given time interval would be a good start. Further refinements could be implemented later, like segregating counts by geographical area and consumer/corporate address range. With these schemes there are no privacy issues as IP addresses are not retained at all in the TSDB (not even pseudonymized/anonymized). Time resolution could be hourly for starting, and then maybe down to the minute for recent history depending on the required processing power and storage.

There will be lots of packages that are rarely downloaded and still important.

Indeed. That's just additional data to help making decisions in cases where we have doubts.

Back of the envelope math says that'd be 600 GB/d of raw syslog log traffic.

I don't think that regular syslog is a reasonable way to retrieve that amount of data from distant hosts. I don't know what are the options with the current cache provider, but transferring already compressed data every hour (or a shorter interval, or streaming compressed data) sounds better. That would amount to ~2 GiB compressed (~25 GiB uncompressed) data every hour on average, which seems workable.

Is there any way I could get a copy of a log file (current ones with 1% sampling) for experimenting?

Cheers,

--
Julien Plissonneau Duquène

Reply via email to