An interesting thing about the output tarballs from my script: if I
rdiff two of them, one of them plus the patch file is significantly
smaller than two of them (presumably because diffs on different days
are nonetheless similar).* This is probably very dependent on what
kind of data is being backed up, but it may lead to a way to make
increment storage even more efficient (but also more fragile, since a
restore would take two levels of merging). It's also very possible
that this is a clear indication that I've done something very wrong in
my script that's causing duplicate data in what are supposed to be
separate increments. Further testing is required ;)
*Example from my test set: a collected increment from 2008-10-04 is
49MB, and the one from 2008-10-05 is also 49MB (total 98MB). An rdiff
delta file to turn 2008-10-04 into 2008-10-05 is only 18MB, so
2008-10-04 plus the delta file is 67MB. Another delta to turn
2008-10-05 into 2008-10-06 is also only 18MB, so the three of them
together are 85MB instead of 147MB. Again, this is probably highly
dependent on the kind of data that's in these increments, but I'm
surprised it works as well as it does given that I'm tarring some
already-gzipped files together.
I've been doing some more experimenting, and I've found a partial
explanation for this. Mirror metadata files are huge! In my case each
one is 88MB (uncompressed), though they're only 6MB when gzipped. There
are only minor differences between consecutive ones (a diff patch from
one to the next is on the order of 61KB uncompressed, 11KB compressed;
an rdiff patch is considerably larger since they're plain text), so in
my example above that explains some of the saved space. File statistics
files probably also don't change much, but they too don't account for
much when compressed (also 5MB apiece).
In round 2 of testing, I tried uncompressing all of the files in an
increment, and then re-storing that as a tar, then generating the rdiff
delta, and then recompressing everything. This yielded a very slight
advantage in compression, but a significant one in rdiff'ing -- the
rdiff deltas are down from 18MB to only 7.4MB (and as I said, some of
that can be explained away by similarities in mirror metadata files).
That leaves, of a 49MB increment: 7.4MB of data that's different + 5MB
of nearly-identical file statistics + 6.1MB of nearly-identical mirror
metadata + another 30.5 MB of data that must identical between the two
increments.
This leads me to suspect that rdiff-backup is storing snapshots of
things that it shouldn't. Even if rdiff-backup routinely stores
snapshots every 10 times a file changes (as was mentioned earlier), I
find it unlikely that this would coincidentally happen to enough files
on 7 consecutive backup runs (I've run this experiment on 7 adjacent
pairs of increments and get similar numbers for all of them) to get the
kind of numbers I'm getting.
Another possibility is that these overlaps can be explained as file
moves. Currently I think rdiff-backup cannot detect a file move, and
stores it as a deletion plus a new file; correct? If so, then perhaps
what's happening here is that part of the backup data set includes
daily-rotated logfiles. Rdiff can detect the identical blocks, because
when I'm using it on tarballs of the entire increment, all of the data
is in one file. Supposing the rotating logs keep 10 files, then
rdiff-backup is seeing 10 files change so drastically that it's cheaper
to store snapshots, but rdiff sees 10 large blocks of identical data
that just happen to have moved down by a unit or two in the tarball.
So, perhaps my harebrained original suggestion of storing increments as
single files has lead to a relatively easy way to implement file move
detection? (I'll be the first to point out, though, that since it
requires tarballs to work from, it's not particularly efficient to
create even if it is efficient to store once it's done. There might be a
better way of this same idea, though.)
~Felix.
_______________________________________________
rdiff-backup-users mailing list at [email protected]
http://lists.nongnu.org/mailman/listinfo/rdiff-backup-users
Wiki URL: http://rdiff-backup.solutionsfirst.com.au/index.php/RdiffBackupWiki