Re: [Bug-tar] Interchange/performance issue with archive containing sparse file

markk Mon, 07 Feb 2011 14:50:58 -0800

Hi,

"Joerg Schilling" wrote:

> ma...@clara.co.uk wrote:
>
>> The archives created by star and GNU tar were of identical size, and
>> both
>> programs could extract each correctly. Extraction times were similar
>> (both
>> under 4 seconds). However, GNU tar took about 2.8 times as long as star
>> to create the archive.
>
> I asume that you are testing on a OS that does not implement
> SEEK_HOLE/SEEK_DATA, as star is even faster that GNU tar in case that the
> OS helps to retrieve the sparse file info.

That's correct, I'm using Linux. Presumably archive creation would have
taken 10 seconds or so with SEEK_HOLE/SEEK_DATA support.

And "Tim Kientzle" wrote:

> You didn't specify what operating system you were using.
> Some operating systems do have support for locating holes,
> but it varies a lot.  Joerg has done a lot of performance work
> on star, so it's entirely possible that he's optimized hole support
> for your OS and the GNU tar folks haven't (yet) done so.

Linux (2.6.31 here currently), so the difference isn't due to star being
able to determine where the holes are and avoid reading through the entire
file. Perhaps star just has more optimised hole support and/or file I/O in
general???

>> ma...@clara.co.uk wrote:
>>
>> In future, the tar file format could be updated to allow sparse files to
>> be archived in a single pass, but it would require ...
>
> I've considered approaches like this for libarchive, but
> I haven't found the time to experiment with them.
>
> Specifically, this could be done without seeking
> (and without completely ignoring the standards)
> by recording a complete tar entry for each "packet" of a file
> (where a packet encodes a hole and a block of data
> following the hole).   GNU tar does something conceptually
> similar for it's multi-volume support: it writes a single file as
> multiple entries in the archive.  Using pax extensions,
> the overhead for this approach would be 1536+ bytes
> for each "packet", or about 0.015% with 10MB packets.

Neat. Could that approach automatically work with existing tar
implementations?

I thought of something similar shortly after sending my original message.
Since I know very little about the tar file format, probably some of this
could be done in a better/neater way...

A sparse file could be represented in the archive like this:
  - File apparent length
followed by one or more "chunk" entries, consisting of:
  - chunk type (see below)
  - length
  - chunk data (if applicable)

There could be several chunk types:
  1. hole. Represents a hole. The following length is the hole size. No
chunk data.
  2. written zeros. Represents a run of written all-zero data. Following
length is the size. No chunk data.
  3. data. Following length is the size of the chunk data. (For non-zero
runs larger than the buffer size, consecutive data chunks would be
written to the archive.)
  4. EOF marker. Put after the last chunk so tar knows is has reached the
end of the file data.

The distinction between holes and written zeros could be used to preserve
the exact sparseness of files (assuming OS support for finding holes),
without bloating the archive storing written all-zero data as-is. On
extraction the user could choose whether to preserve exact sparseness or
maximise sparseness (when tar would consider "written all-zero" chunks the
same as holes).

That approach could be extended to cover other repeated values. For
example, some types of media contain the same (non-zero) byte/bytes
repeated when freshly-formatted. If the archiver detected those repeated
blocks, it could encode runs using another chunk type:
  5. repeated run. Represents a repeated run of data. Following length is
the run length, then number of repeats, then the run data.

Talking of sparse files... or rather, not. I'm not talking about archiving
sparse files now, but tar's operation when it creates an archive, when
source file(s) contain all-zero data. If the archive is a file (i.e.
seekable), the disk space occupied by the archive could be minimised if
tar could write the archive sparsely. That is, seek forward instead of
writing all-zero data. So in that case *the tar archive itself* would be
sparse. That would be most useful when disk I/O bandwidth is limited, e.g.
writing to a USB hard disk. (And there would be no overhead on creation
from having to read source files twice, since --sparse wasn't used.)

At the expense of some performance, that can be achieved using a program
which can take its input and "sparsify" it. One such program is ddpt; see
  http://sg.danny.cz/sg/ddpt.html

An example...
Create a non-sparse 10MB file:
  $ dd if=/dev/zero of=10MB.bin bs=1048576 count=10
Archive it, piping the archive to ddpt which will write it sparsely:
  $ tar -c 10MB.bin | ddpt if=- of=10MB.tar bs=512 bpt=128,1 oflag=sparse
The result:
  $ du --bytes 10MB.tar ; du --block-size=1 10MB.tar
  10496000        10MB.tar
  8192    10MB.tar

The same approach can be used when e.g. decompressing a bzip2-compressed
file; you can pipe its output to make it sparse, since bzip2 et al don't
support sparse writing themselves. (The new XZ Utils *does* support sparse
writing, btw.)

Re: [Bug-tar] Interchange/performance issue with archive containing sparse file

Reply via email to