Re: Package statistics by downloads

2025-05-03 Thread Philipp Kern

On 2025-05-03 03:35, Otto Kekäläinen wrote:

I'm interested in package popularity. I'm aware of popcon
(https://popcon.debian.org/), but I'm more interested in actual
downloads.


I am also interested in usage statistics. I feel it is much more
meaningful to work on packages that I know how have a lot of users.

While neither popcon of download stats are accurate, they still show
trends and relative numbers which can be used to make useful
conclusions. I would be glad to see if people could share ideas on
what stats we could collect and publish instead of just pointing out
flaws in various stats.


The problem is that we currently do not want to retain this data. It'd 
require a clear measure of usefulness, not just a "it would be nice if 
we had it". And there would need to be actual criteria of what we would 
be interested in. Raw download count? Some measure of bucketing by 
source IP or not? What about container/hermetic builders fetching the 
same ancient package over and over again from snapshot? Does the version 
matter?


In the end there would probably need to be a proof of concept of a log 
processor that's privacy-friendly and gives us the metrics that we 
actually want. Hence my question what these metrics are for, except for 
a fuzzy feeling of "working on the right priorities". There will be lots 
of packages that are rarely downloaded and still important.


Everyone can ask "please just retain all logs and we will do analysis on 
them later". Right now it'd be infeasible to get the statistics from the 
mirrors, and we could at most get statistics for deb.d.o. To give a 
sense of scale: We are sampling 1% of cache hits and all errors right 
now. That's 6.7 GB/d uncompressed (500 M/d compressed). Back of the 
envelope math says that'd be 600 GB/d of raw syslog log traffic. We 
should have a very good reason for collecting this much data.


Kind regards
Philipp Kern



Re: Package statistics by downloads

2025-05-03 Thread Erik Schulz
I suspect that compliance with GDPR would require the data to be
stored minimally.
It seems reasonable to me that a 24-hour window would reduce most
repeat-downloads.
If you stream the request log and reduce to (ip,package,version), it
will be minimal.
I think it would fit into memory, e.g. 10 million unique IP adresses x
100 packages x 40 bytes = 40 GB
The program code could ideally be generalized and used by other distros as well.

On Sat, May 3, 2025 at 10:43 AM Philipp Kern  wrote:
>
> On 2025-05-03 03:35, Otto Kekäläinen wrote:
> >> I'm interested in package popularity. I'm aware of popcon
> >> (https://popcon.debian.org/), but I'm more interested in actual
> >> downloads.
> >
> > I am also interested in usage statistics. I feel it is much more
> > meaningful to work on packages that I know how have a lot of users.
> >
> > While neither popcon of download stats are accurate, they still show
> > trends and relative numbers which can be used to make useful
> > conclusions. I would be glad to see if people could share ideas on
> > what stats we could collect and publish instead of just pointing out
> > flaws in various stats.
>
> The problem is that we currently do not want to retain this data. It'd
> require a clear measure of usefulness, not just a "it would be nice if
> we had it". And there would need to be actual criteria of what we would
> be interested in. Raw download count? Some measure of bucketing by
> source IP or not? What about container/hermetic builders fetching the
> same ancient package over and over again from snapshot? Does the version
> matter?
>
> In the end there would probably need to be a proof of concept of a log
> processor that's privacy-friendly and gives us the metrics that we
> actually want. Hence my question what these metrics are for, except for
> a fuzzy feeling of "working on the right priorities". There will be lots
> of packages that are rarely downloaded and still important.
>
> Everyone can ask "please just retain all logs and we will do analysis on
> them later". Right now it'd be infeasible to get the statistics from the
> mirrors, and we could at most get statistics for deb.d.o. To give a
> sense of scale: We are sampling 1% of cache hits and all errors right
> now. That's 6.7 GB/d uncompressed (500 M/d compressed). Back of the
> envelope math says that'd be 600 GB/d of raw syslog log traffic. We
> should have a very good reason for collecting this much data.
>
> Kind regards
> Philipp Kern
>



Re: Package statistics by downloads

2025-05-03 Thread Peter B

On 03/05/2025 02:35, Otto Kekäläinen wrote:

I am also interested in usage statistics. I feel it is much more
meaningful to work on packages that I know how have a lot of users.

+1


While neither popcon of download stats are accurate, they still show
trends and relative numbers which can be used to make useful
conclusions. I would be glad to see if people could share ideas on
what stats we could collect and publish instead of just pointing out
flaws in various stats.


i was disappointed when Ubuntu dropped publishing popcon data.

My understanding is that popcon is set up to report data to an address
that is distro dependent. Do any of our downstreams actually harvest 
this info?


Maybe instead the downstream data could come to Debian with the distro
as an attribute?  Without factoring in the downstream data, desktop package
usage overall is likely to be understated.


Cheers,
Peter



Re: Package statistics by downloads

2025-05-03 Thread Adam D. Barratt
On Sat, 2025-05-03 at 11:16 +0200, Erik Schulz wrote:
> I suspect that compliance with GDPR would require the data to be
> stored minimally.
> It seems reasonable to me that a 24-hour window would reduce most
> repeat-downloads.
> If you stream the request log and reduce to (ip,package,version), it
> will be minimal.
> I think it would fit into memory, e.g. 10 million unique IP adresses
> x 100 packages x 40 bytes = 40 GB

Where has 100 packages come from here? There are 34 *thousand* source
packages in bookworm, i.e. over 100 times your quoted estimate.

You also seem to have underestimated quite a bit if you believe that
you can fit an IPv6 address, a package name and a package version into
40 bytes in most cases, yet alone all.

(As an aside, the RAM allocation on the logging hosts is currently
2GB.)

Regards,

Adam



Re: Package statistics by downloads

2025-05-03 Thread Erik Schulz
Memory usage approximations:
per tuple:
ipv6 = 16
package pointer = 3 (assuming <16777216 packages)
version pointer = 2 (assuming <65536 distinct version names)
+ some overhead
=> ~ 40 B seems fair?
But you could also just write to disk. It'll wear out an SSD though,
and random r/w on a harddrive is slow.

> Where has 100 packages come from here?
That would be the average number of downloaded packages per IP per
day. I assume some would just download a single package, while others
are installing an entire system of +1000 packages, but 100 on average
seems a fair ballpark number.
What number do you suggest?




On Sat, May 3, 2025 at 11:39 AM Adam D. Barratt
 wrote:
>
> On Sat, 2025-05-03 at 11:16 +0200, Erik Schulz wrote:
> > I suspect that compliance with GDPR would require the data to be
> > stored minimally.
> > It seems reasonable to me that a 24-hour window would reduce most
> > repeat-downloads.
> > If you stream the request log and reduce to (ip,package,version), it
> > will be minimal.
> > I think it would fit into memory, e.g. 10 million unique IP adresses
> > x 100 packages x 40 bytes = 40 GB
>
> Where has 100 packages come from here? There are 34 *thousand* source
> packages in bookworm, i.e. over 100 times your quoted estimate.
>
> You also seem to have underestimated quite a bit if you believe that
> you can fit an IPv6 address, a package name and a package version into
> 40 bytes in most cases, yet alone all.
>
> (As an aside, the RAM allocation on the logging hosts is currently
> 2GB.)
>
> Regards,
>
> Adam
>



Bug#1008532: ITP: fcitx5-mcbopomofo -- McBopomofo input method for fcitx5

2025-05-03 Thread 李健秋
Package: wnpp
Followup-For: Bug #1008532
X-Debbugs-Cc: debian-devel@lists.debian.org, ajq...@debian.org


Dear ChangZhuo,

I saw you have alrady packaged this package on
https://salsa.debian.org/input-method-team/fcitx5-mcbopomofo/

Do you forget to upload it?

Best regards,

-Andrew



Adding Pre-Depends from linux-image packages to linux-base

2025-05-03 Thread Ben Hutchings
Hi all,

I'm proposing to add a linux-run-hooks command to the linux-base package
[1] that will then be used in all maintainer scripts of linux-image
packages [2].  This requires upgrading the current Depends on linux-base
to Pre-Depends.

This message is to start the discussion required by policy for a new use
of Pre-Depends.

Both packages are under kernel team maintenance, and linux-base has
minimal dependencies (debconf | debconf-2.0).  So I don't anticipate
this causing any problems with upgrades.

Ben.

[1] https://salsa.debian.org/kernel-team/linux-base/-/merge_requests/14
[2] https://salsa.debian.org/kernel-team/linux/-/merge_requests/1493

-- 
Ben Hutchings
It is impossible to make anything foolproof
because fools are so ingenious.


signature.asc
Description: This is a digitally signed message part