Hi Simon,

On Thu, Apr 17, 2025 at 08:23:18PM +0200, Simon Josefsson wrote:
> I noticed that Fedora 42 was released and their docker images lack a
> 'awk' tool.  Debian trixie images ship with 'mawk' pre-installed right
> now.  While I'm not convinced the removal game is necessarily a good
> one, I can see that it does have some advantages.  Is it possible to
> drop 'mawk' from the set of default tools in trixie?  If not, what are
> the blockers?  What is the method to find out what the blockers are?

shrinking essential/minbase/container images generally is a worthwhile 
goal as you saw from existing replies. What is not as useful is asking 
"can we drop XXX?" with little context, because (as others indicated) 
this is a ton of work.  The way to advance these matters is doing 
research.

One of the first aspects is what "dropping" means. Typical answers:
 * Removing "Essential: yes"
   * e2fsprogs, mount and a few more used to be essential.
 * Removing dependencies
   * apt (not essential, but close) used to depend on adduser.
 * Reducing the Priority value
   * We've been debating this for ifupdown.
 * Removing dependencies within the build-essential set
   * I recently proposed removing libcrypt-dev from build-essential.

In this case, the immediate meaning must be getting it out of essential. 
However, that does not move it out of container images, which incurs 
further work and also raises the user impact (see Sean's mail).

Next, there is a question of what we gain. Essential weighs in at 
roughly 100MB (depending on how you count it). So regarding awk, we're 
talking about a size reduction of about 0.3%. For comparison, being able 
to substitute toybox for coreutils has the potential to reduce more than 
10% of size. Removing bash (keeping dash) would be around 7%. Whilst 
those other gains are significantly higher, their impact and effort also 
is. Picking a sensible candidate is the difficult part here.

It leads us to analyzing the effort and impact. Being in the essential 
set means that dependencies are not spelled out. So the first step is 
locating those dependencies. As we will likely not be able to audit 
Debian's source code for awk uses in a reasonable amount of time, 
empirical methods are likely needed.
 * Rebuild the archive with awk dropped and see what fails
 * Consider using reproducible builds to additionally see what packages
   change as a result of dropping awk (for those that happen to be 
   reproducible)
 * Search for awk usage in maintainer scripts
   https://binarycontrol.debian.net/?q=awk&path=unstable%2F.*%2Fp
   Note that postrm scripts cannot express dependencies and need to be 
   rewritten without awk. It also means that if you assume people to 
   always purge their packages, we may remove awk in forky+1 at best if 
   we manage to fix all postrm in forky.
 * Download all Debian binary packages and search for awk uses in the 
   installed files using regular expressions.
 * Run autopkgtests with awk removed

Doing this is a ton of work. Doing that work and presenting the results 
is what makes "can we drop awk?" a useful question as it answers the 
cost part.

This is not meant to discourage you. Quite to the contrary. Reducing 
implicit software dependencies has lots of other benefits such as easing 
architecture bootstrapping and a smaller trusted computing base. It is a 
topic you cannot do in a spare evening though.

For instance, I'd like to propose making coreutils substitutable in 
essential like awk is substitutable. However, that question is not 
presently "useful" in the sense that it lacks a sound implementation.  
I've been pondering this with Jochen and Johannes back in Würzburg and 
now Julian has picked up the question and arrived at a promising 
prototype based on feedback from Guillem. I hope that we are discussing 
coreutils soon, but that discussion will be so much more useful when it 
comes with a prototype and an impact analysis.

Helmut

Reply via email to