[ CCing Raphael as showing interest off-list, CCing Jonathan for input
  on liblzma MT-support. ]

Hi!

On Tue, 2008-10-07 at 23:08:08 +0800, Paul Wise wrote:
> Package: dpkg
> Version: 1.14.22
> Severity: wishlist

> For those with systems with multiple CPUs, it would be nice if dpkg
> could support multi-threaded decompression and compression. Here are
> some implementations and possible hints to implementations of these:

On Thu, 2014-04-03 at 11:05:36 +0300, Riku Voipio wrote:
> The mentioned implmentations are now all in debian:
> 
> pigz - Parallel Implementation of GZip
>  code lines 4304
> pixz - parallel, indexing XZ compressor/decompressor
>  code lines 2169
> pbzip2 - parallel bzip2 implementation
>  code lines 6119 (c++)
> 
> Unsurprisingly pixz is also the most readable threaded
> compressing/uncompressing implementation. From a quick glance
> I guess the liblzma api makes it easier than others.
> 
> Now that xz is a) default b) most cpu-heavy, perhaps it's also the only
> one worth parallelizing.

Ok, here go my thoughts. I agree it would be nice in principle. Ideally
I'd like to have parallel support natively in libdpkg (indirectly or
directly), to avoid external dependencies, to avoid exposing internal
implementation details, and because I've been pondering about removing
the code to use the command-line tools (which is currently disabled in
Debian anyway). But that might imply using threads and the dpkg codebase
is definitely not yet thread-safe. I guess it could be implemented in
multiple processes at the cost of possibly a slightly more complicated
implementation.

But, meanwhile as a workaround I've also pondered about making both the
compression library and the command-line tools available at the same time
so that the user could request one or the other, which would also allow
to add additional compressor command-line tools, although have not done
anything in that direction because these pose several problems, see
below.

Some comments on external implementations:

 * liblzma does have a native multi-threaded encoder upstream but it's
   disabled in Debian, debian/patches/abi-threaded-encoder. This would
   make adding native support to dpkg pretty easy. But it seems latest
   position from upstream is that this support could even disappear?
   Jonathan any thoughts or further data on this?
 * pigz uses an internal compressing library (zopfli) that claims to
   compress better at the cost of being very slow, so I'm not sure it's
   worth using. It appears to use zlib for decompression. Would need
   more checking and benchmarks.
 * pixz seems nice, but I'm not sure about its file format compatibility
   when it emits the tarball index. Fortunately it can be disabled
   altogether with the -t option. Would need taking a look too. I'm not
   really prepared to even consider making dpkg Pre-Depend on pixz for
   example, because I don't think pulling libarchive as pseudo-essential
   would be acceptable, but otherwise it's pretty small.
 * pxz was not mentioned but I don't think it's really an option, as
   it uses temporary files… Its implementation uses OpenMP so it would
   really not be an option to lift into libdpkg either.
 * pbzip2 is mainly only useful for compression, as decompression
   requires the data to have been compressed with pbzip2 itself.
   bzip2 is deemed deprecated, so probably not worth it I'd say.

So, until this is natively implemented in libdpkg, I guess it might
make some sense to consider at least supporting pixz in some form
(eventual parallel gzip would be nice too, as the other non-deprecated
compressor supported). Tests would need to be performed on single and
multi-core setups, to see the speed/memory-usage/disk-size differences
with small and huge packages. Then there are several ways the user could
enable this in dpkg-deb, none of which are pretty:

 1) Through a new compressor name to dpkg-deb, say -Zpixz, which
    could fallback to use xz if pixz is not available.

    It would require versioned Build-Depends on the dpkg implementing
    that plus a Build-Depends on pixz in the package making use of
    it. This would be an ok option if there's no major difference in
    the implementations behavior, so that anyone trying to build
    a huge package benefits, and so that no special setup is
    required on buildds. A drawback is that this exposes an internal
    implementation detail, and would not allow to, say, change the
    external tool, as the dependencies would stop being valid.

 2) Through a new dpkg-deb option to enable parallel compression,
    say --parallel-compression, so that dpkg-deb would try the
    parallel tool if it's available instead of the default
    implementation.

    This has almost the same requirements and drawbacks from the
    package maintainers PoV than the -Zpixz option.

 3) Through a new environment variable, so that dpkg-deb would try the
    known tool if it's available instead of the default implementation.
    Say DPKG_DEB_COMPRESSOR=parallel or =pixz or similar.

    This would avoid any versioned Build-Depends on dpkg, as the tool
    would not fail on unknown options. But would require any buildd
    wanting to benefit from this to set that explicitly, and to make
    sure that the environment is not cleaned up, and to manually
    install the tools, which could vary depending on the compression
    used by the package, which is also an internal implementation
    detail that could also change. Or to change debian/rules to set
    the envvar and Build-Depend on the compressor.

    The semantics for =parallel or =pixz are going to be different,
    the former could complement -Zxz (or its omission), but the later
    would get overriden by the -Z option. So either has pros/cons.

I presume, any of the above options that automatically falls back to
the non-parallel implementation, or that changes behavior depending on
environment variables, will make the reproducible builds people quite
unhappy, though.


Hmm, it just occurred to me another very viable temporary option could
be to implement minimal threaded xz and gzip (de)compressors as small
helper programs that would not make any use of libdpkg. And change
libdpkg to use those if requested. This seems would have most of the
benefits w/ very minor drawbacks.


So, all in all, my preferred deployment options in decreasing order,
would be:

 * Implement native parallel support in the compression libraries
   (might require adding thread-safey all over the dpkg codebase,
   depending on the implementation).
 * Implement native parallel support in libdpkg (requires first adding
   thread-safey all over the dpkg codebase).
 * Implement native parallel support in the standard compression
   tools, at least gzip and xz (probably unrealistic).
   Change dpkg to allow using the tools in addition to the libraries.
 * Implement native parallel compressor helpers in dpkg, to be used by
   libdpkg.
   Change dpkg to allow using the tools in addition to the libraries.
 * Change the compression tools to either use alternatives (!?) or
   modify the parallel variants to divert the standard ones (at least
   gzip is Essential, so this is probably out of the table). Some might
   need wrapper scripts as they do not implement all command-line options.
   Change dpkg to allow using the tools in addition to the libraries.
 * Change dpkg to explicitly use one of the parallel variants, as
   detailed above.


Hope this gives some light on the issues here.

Thanks,
Guillem


-- 
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org

Reply via email to