Re: Package statistics by downloads
On 2025-05-03 03:35, Otto Kekäläinen wrote: I'm interested in package popularity. I'm aware of popcon (https://popcon.debian.org/), but I'm more interested in actual downloads. I am also interested in usage statistics. I feel it is much more meaningful to work on packages that I know how have a lot of users. While neither popcon of download stats are accurate, they still show trends and relative numbers which can be used to make useful conclusions. I would be glad to see if people could share ideas on what stats we could collect and publish instead of just pointing out flaws in various stats. The problem is that we currently do not want to retain this data. It'd require a clear measure of usefulness, not just a "it would be nice if we had it". And there would need to be actual criteria of what we would be interested in. Raw download count? Some measure of bucketing by source IP or not? What about container/hermetic builders fetching the same ancient package over and over again from snapshot? Does the version matter? In the end there would probably need to be a proof of concept of a log processor that's privacy-friendly and gives us the metrics that we actually want. Hence my question what these metrics are for, except for a fuzzy feeling of "working on the right priorities". There will be lots of packages that are rarely downloaded and still important. Everyone can ask "please just retain all logs and we will do analysis on them later". Right now it'd be infeasible to get the statistics from the mirrors, and we could at most get statistics for deb.d.o. To give a sense of scale: We are sampling 1% of cache hits and all errors right now. That's 6.7 GB/d uncompressed (500 M/d compressed). Back of the envelope math says that'd be 600 GB/d of raw syslog log traffic. We should have a very good reason for collecting this much data. Kind regards Philipp Kern
Re: Package statistics by downloads
I suspect that compliance with GDPR would require the data to be stored minimally. It seems reasonable to me that a 24-hour window would reduce most repeat-downloads. If you stream the request log and reduce to (ip,package,version), it will be minimal. I think it would fit into memory, e.g. 10 million unique IP adresses x 100 packages x 40 bytes = 40 GB The program code could ideally be generalized and used by other distros as well. On Sat, May 3, 2025 at 10:43 AM Philipp Kern wrote: > > On 2025-05-03 03:35, Otto Kekäläinen wrote: > >> I'm interested in package popularity. I'm aware of popcon > >> (https://popcon.debian.org/), but I'm more interested in actual > >> downloads. > > > > I am also interested in usage statistics. I feel it is much more > > meaningful to work on packages that I know how have a lot of users. > > > > While neither popcon of download stats are accurate, they still show > > trends and relative numbers which can be used to make useful > > conclusions. I would be glad to see if people could share ideas on > > what stats we could collect and publish instead of just pointing out > > flaws in various stats. > > The problem is that we currently do not want to retain this data. It'd > require a clear measure of usefulness, not just a "it would be nice if > we had it". And there would need to be actual criteria of what we would > be interested in. Raw download count? Some measure of bucketing by > source IP or not? What about container/hermetic builders fetching the > same ancient package over and over again from snapshot? Does the version > matter? > > In the end there would probably need to be a proof of concept of a log > processor that's privacy-friendly and gives us the metrics that we > actually want. Hence my question what these metrics are for, except for > a fuzzy feeling of "working on the right priorities". There will be lots > of packages that are rarely downloaded and still important. > > Everyone can ask "please just retain all logs and we will do analysis on > them later". Right now it'd be infeasible to get the statistics from the > mirrors, and we could at most get statistics for deb.d.o. To give a > sense of scale: We are sampling 1% of cache hits and all errors right > now. That's 6.7 GB/d uncompressed (500 M/d compressed). Back of the > envelope math says that'd be 600 GB/d of raw syslog log traffic. We > should have a very good reason for collecting this much data. > > Kind regards > Philipp Kern >
Re: Package statistics by downloads
On 03/05/2025 02:35, Otto Kekäläinen wrote: I am also interested in usage statistics. I feel it is much more meaningful to work on packages that I know how have a lot of users. +1 While neither popcon of download stats are accurate, they still show trends and relative numbers which can be used to make useful conclusions. I would be glad to see if people could share ideas on what stats we could collect and publish instead of just pointing out flaws in various stats. i was disappointed when Ubuntu dropped publishing popcon data. My understanding is that popcon is set up to report data to an address that is distro dependent. Do any of our downstreams actually harvest this info? Maybe instead the downstream data could come to Debian with the distro as an attribute? Without factoring in the downstream data, desktop package usage overall is likely to be understated. Cheers, Peter
Re: Package statistics by downloads
On Sat, 2025-05-03 at 11:16 +0200, Erik Schulz wrote: > I suspect that compliance with GDPR would require the data to be > stored minimally. > It seems reasonable to me that a 24-hour window would reduce most > repeat-downloads. > If you stream the request log and reduce to (ip,package,version), it > will be minimal. > I think it would fit into memory, e.g. 10 million unique IP adresses > x 100 packages x 40 bytes = 40 GB Where has 100 packages come from here? There are 34 *thousand* source packages in bookworm, i.e. over 100 times your quoted estimate. You also seem to have underestimated quite a bit if you believe that you can fit an IPv6 address, a package name and a package version into 40 bytes in most cases, yet alone all. (As an aside, the RAM allocation on the logging hosts is currently 2GB.) Regards, Adam
Re: Package statistics by downloads
Memory usage approximations: per tuple: ipv6 = 16 package pointer = 3 (assuming <16777216 packages) version pointer = 2 (assuming <65536 distinct version names) + some overhead => ~ 40 B seems fair? But you could also just write to disk. It'll wear out an SSD though, and random r/w on a harddrive is slow. > Where has 100 packages come from here? That would be the average number of downloaded packages per IP per day. I assume some would just download a single package, while others are installing an entire system of +1000 packages, but 100 on average seems a fair ballpark number. What number do you suggest? On Sat, May 3, 2025 at 11:39 AM Adam D. Barratt wrote: > > On Sat, 2025-05-03 at 11:16 +0200, Erik Schulz wrote: > > I suspect that compliance with GDPR would require the data to be > > stored minimally. > > It seems reasonable to me that a 24-hour window would reduce most > > repeat-downloads. > > If you stream the request log and reduce to (ip,package,version), it > > will be minimal. > > I think it would fit into memory, e.g. 10 million unique IP adresses > > x 100 packages x 40 bytes = 40 GB > > Where has 100 packages come from here? There are 34 *thousand* source > packages in bookworm, i.e. over 100 times your quoted estimate. > > You also seem to have underestimated quite a bit if you believe that > you can fit an IPv6 address, a package name and a package version into > 40 bytes in most cases, yet alone all. > > (As an aside, the RAM allocation on the logging hosts is currently > 2GB.) > > Regards, > > Adam >
Bug#1008532: ITP: fcitx5-mcbopomofo -- McBopomofo input method for fcitx5
Package: wnpp Followup-For: Bug #1008532 X-Debbugs-Cc: debian-devel@lists.debian.org, ajq...@debian.org Dear ChangZhuo, I saw you have alrady packaged this package on https://salsa.debian.org/input-method-team/fcitx5-mcbopomofo/ Do you forget to upload it? Best regards, -Andrew
Adding Pre-Depends from linux-image packages to linux-base
Hi all, I'm proposing to add a linux-run-hooks command to the linux-base package [1] that will then be used in all maintainer scripts of linux-image packages [2]. This requires upgrading the current Depends on linux-base to Pre-Depends. This message is to start the discussion required by policy for a new use of Pre-Depends. Both packages are under kernel team maintenance, and linux-base has minimal dependencies (debconf | debconf-2.0). So I don't anticipate this causing any problems with upgrades. Ben. [1] https://salsa.debian.org/kernel-team/linux-base/-/merge_requests/14 [2] https://salsa.debian.org/kernel-team/linux/-/merge_requests/1493 -- Ben Hutchings It is impossible to make anything foolproof because fools are so ingenious. signature.asc Description: This is a digitally signed message part