Re: Package statistics by downloads

Julien Plissonneau Duquène Mon, 26 May 2025 02:18:13 -0700

Hi,

I would be interested in per-package-and-version download statistics andtrends as well.


Le 2025-05-03 09:28, Philipp Kern a écrit :


The problem is that we currently do not want to retain this data.

You're absolutely right here, there is no point in retaining the rawdata, it gets stale pretty fast anyway. It has to be processed withminimal delay and then fed into some kind of time-series database.

It'd require a clear measure of usefulness, not just a "it would benice if we had it". And there would need to be actual criteria of whatwe would be interested in. Raw download count? Some measure ofbucketing by source IP or not? What about container/hermetic buildersfetching the same ancient package over and over again from snapshot?Does the version matter?

It would help (as an additional data input) when having to makedecisions about keeping or removing packages, especially those with verylow popcons. I would also expect the download counts to have aninteresting significance (for the sake of estimating the installed base)right after releasing a package update.

Having the count of total successful downloads and the count of uniqueIPs for a given package+version couple (= URI) within a given timeinterval would be a good start. Further refinements could be implementedlater, like segregating counts by geographical area andconsumer/corporate address range. With these schemes there are noprivacy issues as IP addresses are not retained at all in the TSDB (noteven pseudonymized/anonymized). Time resolution could be hourly forstarting, and then maybe down to the minute for recent history dependingon the required processing power and storage.

There will be lots of packages that are rarely downloaded and stillimportant.

Indeed. That's just additional data to help making decisions in caseswhere we have doubts.

Back of the envelope math says that'd be 600 GB/d of raw syslog logtraffic.

I don't think that regular syslog is a reasonable way to retrieve thatamount of data from distant hosts. I don't know what are the optionswith the current cache provider, but transferring already compresseddata every hour (or a shorter interval, or streaming compressed data)sounds better. That would amount to ~2 GiB compressed (~25 GiBuncompressed) data every hour on average, which seems workable.

Is there any way I could get a copy of a log file (current ones with 1%sampling) for experimenting?


Cheers,

--
Julien Plissonneau Duquène

Re: Package statistics by downloads

Reply via email to