On 27/05/2015 12:44, Jérémy Bobbio wrote:
Hi!
We are working in Debian— and I know other free software projects
care— in providing our users with a way to reproduce bit-for-bit
identical binary packages from the source and build enviroment.
See <https://wiki.debian.org/ReproducibleBuilds/About> for some
rationale and further explainations.
In order to do this, we need to make our build processes as
deterministic as possible. As you can imagine, Tar is quite involved in
producing Debian packages. A straightforward call leads to multiple
issues:
* Order of files in the archive will depend on the filesystem order.
* User and group names are recorded. This can be seen as a privacy leak
for the package builder.
* Permissions are dependent on the builder umask.
* Last modification times of members of files created during the build
will be dependent on the build time.
* Also, if gzip compression is used, a timestamp will be recorded in
gzip header.
So, we are currently turning calls like:
tar -zcf archive.tar.gz src
into:
find src -print0 | LC_ALL=C sort -z |
GZIP=-9n tar --null -T - --no-recursion \
--owner=root --group=root --numeric-owner \
--mode=go=rX,u+rw,a-s \
--mtime=debian/changelog \
-zcf archive.tar
It would be great to avoid at least some of the boilerplate. Finding a
generic solution for permissions and modification times might be too
much, but having a `--deterministic` flag for the rest of the issues
would be quite helpful already.
What do you think?
Agree in principle. Note that the boilerplate you
show looks like it doesn't handle:
- Creation/Access times (if stored in tar headers).
- Random gzip version dependencies (also affects DAK
producing different gzipped index files depending on
the Debian release installed on/near master).
- statoverride integration for suid/sgid binariesand
special dir flags (mostly in basefilesand /usr/local).
- Adding .gz extention to archive.tar (probably just
a typo).
Which probably makes the real command line even longer.
Also, at least a few versions back, dpkg-source
produced the wrong file timestamps in .diff.gz
files, affecting the consistency of source file
timestamps.
Now for tar, I would suggest (as a future feature) three
new determinism options:
--nomode : Short for --owner=root --group=root
--numeric-owner --mode=go=rX,u+rw, except
for suid/sgid entries. Combine with
--mode=a-s to make all files root:root with
no suid/sgid bits.
For more advanced permission systems (acls
etc.) --nomode will in general archive each
entry as if all non-modify permissions are
the union of those granted to any users, while
modify permissions are for owner only and any
special attributes (sgid/suid/capabilities
etc.) are preserved.
--sort : Causes the entries in each processed
directory to be output in Asciibetical order
(thus each dir needs to be loaded into memory
and sorted, using a locale-independent
strcmp() variant, but no need to preload
entire file listing).
--onepass : (not for package builders): If a file
changes while being archived, the archived
file contents, file length and sparse holes
will all be determined from a single read()
pass over the file until end of file reached.
This is in contrast to the current two-pass
logic where length and holes are found on a
first pass, contents of non-holes on a second
pass, thus --onepass provides guarantees to
applications (such as databases) that a
restored file will have the property that if
something in the file indicates that
something earlier in the file was updated to
checkpoint X, then that will be true, just
as if the backup had been done with cat.
The kernel/filesys is responsible for
presenting a consistent view of each file to
all processes/handles (a property already
needed for ordinary interprocess use of a
shared file).
Enjoy
Jakob
--
Jakob Bohm, CIO, Partner, WiseMo A/S. https://www.wisemo.com
Transformervej 29, 2860 Søborg, Denmark. Direct +45 31 13 16 10
This public discussion message is non-binding and may contain errors.
WiseMo - Remote Service Management for PCs, Phones and Embedded