Hi, "Joerg Schilling" wrote:
> ma...@clara.co.uk wrote: > >> The archives created by star and GNU tar were of identical size, and >> both >> programs could extract each correctly. Extraction times were similar >> (both >> under 4 seconds). However, GNU tar took about 2.8 times as long as star >> to create the archive. > > I asume that you are testing on a OS that does not implement > SEEK_HOLE/SEEK_DATA, as star is even faster that GNU tar in case that the > OS helps to retrieve the sparse file info. That's correct, I'm using Linux. Presumably archive creation would have taken 10 seconds or so with SEEK_HOLE/SEEK_DATA support. And "Tim Kientzle" wrote: > You didn't specify what operating system you were using. > Some operating systems do have support for locating holes, > but it varies a lot. Joerg has done a lot of performance work > on star, so it's entirely possible that he's optimized hole support > for your OS and the GNU tar folks haven't (yet) done so. Linux (2.6.31 here currently), so the difference isn't due to star being able to determine where the holes are and avoid reading through the entire file. Perhaps star just has more optimised hole support and/or file I/O in general??? >> ma...@clara.co.uk wrote: >> >> In future, the tar file format could be updated to allow sparse files to >> be archived in a single pass, but it would require ... > > I've considered approaches like this for libarchive, but > I haven't found the time to experiment with them. > > Specifically, this could be done without seeking > (and without completely ignoring the standards) > by recording a complete tar entry for each "packet" of a file > (where a packet encodes a hole and a block of data > following the hole). GNU tar does something conceptually > similar for it's multi-volume support: it writes a single file as > multiple entries in the archive. Using pax extensions, > the overhead for this approach would be 1536+ bytes > for each "packet", or about 0.015% with 10MB packets. Neat. Could that approach automatically work with existing tar implementations? I thought of something similar shortly after sending my original message. Since I know very little about the tar file format, probably some of this could be done in a better/neater way... A sparse file could be represented in the archive like this: - File apparent length followed by one or more "chunk" entries, consisting of: - chunk type (see below) - length - chunk data (if applicable) There could be several chunk types: 1. hole. Represents a hole. The following length is the hole size. No chunk data. 2. written zeros. Represents a run of written all-zero data. Following length is the size. No chunk data. 3. data. Following length is the size of the chunk data. (For non-zero runs larger than the buffer size, consecutive data chunks would be written to the archive.) 4. EOF marker. Put after the last chunk so tar knows is has reached the end of the file data. The distinction between holes and written zeros could be used to preserve the exact sparseness of files (assuming OS support for finding holes), without bloating the archive storing written all-zero data as-is. On extraction the user could choose whether to preserve exact sparseness or maximise sparseness (when tar would consider "written all-zero" chunks the same as holes). That approach could be extended to cover other repeated values. For example, some types of media contain the same (non-zero) byte/bytes repeated when freshly-formatted. If the archiver detected those repeated blocks, it could encode runs using another chunk type: 5. repeated run. Represents a repeated run of data. Following length is the run length, then number of repeats, then the run data. Talking of sparse files... or rather, not. I'm not talking about archiving sparse files now, but tar's operation when it creates an archive, when source file(s) contain all-zero data. If the archive is a file (i.e. seekable), the disk space occupied by the archive could be minimised if tar could write the archive sparsely. That is, seek forward instead of writing all-zero data. So in that case *the tar archive itself* would be sparse. That would be most useful when disk I/O bandwidth is limited, e.g. writing to a USB hard disk. (And there would be no overhead on creation from having to read source files twice, since --sparse wasn't used.) At the expense of some performance, that can be achieved using a program which can take its input and "sparsify" it. One such program is ddpt; see http://sg.danny.cz/sg/ddpt.html An example... Create a non-sparse 10MB file: $ dd if=/dev/zero of=10MB.bin bs=1048576 count=10 Archive it, piping the archive to ddpt which will write it sparsely: $ tar -c 10MB.bin | ddpt if=- of=10MB.tar bs=512 bpt=128,1 oflag=sparse The result: $ du --bytes 10MB.tar ; du --block-size=1 10MB.tar 10496000 10MB.tar 8192 10MB.tar The same approach can be used when e.g. decompressing a bzip2-compressed file; you can pipe its output to make it sparse, since bzip2 et al don't support sparse writing themselves. (The new XZ Utils *does* support sparse writing, btw.)