On Jun 8, 2009, at 12:55 AM, Joe Landman wrote:
Lawrence Stewart wrote:
[...]
Yup this is it, but on the fly is the hard part. Doing this
comparison is computationally very expensive. The hash
calculations are not cheap by any measure. You most decidedly do
not wish to do this on the fly ...
The assumption of a high performance disk/file system is implicit
here.
And for that, it’s all about trading between cost of storage
space, retrieval time, and computational effort to run the
algorithm.
Exactly.
I think the hash calculations are pretty cheap, actually. I just
timed sha1sum on a 2.4 GHz core2 and it runs at 148 Megabytes per
second, on one core (from the disk cache). That is substantially
faster than the disk transfer rate. If you have a parallel
filesystem, you can parallize the hashes as well.
Disk transfer rates are 100-120 MB/s these days. For high
performance local file systems (N * disks), the data rates you
showed for sha1 won't cut it. Especially if they have multiple hash
signatures computed in order to avoid hash collisions (ala MD5 et
al). The probability of two different hashes having two or more
unique blocks that have a collision in the same manner is somewhat
smaller than the probability that a single hash function has two (or
more) unique blocks that have a collision. So you compute two or
more in different ways, to reduce the probability of that collision.
For laughs, I just tried this on our lab server. We have a 500 MB/s
file system attached to it, so we aren't so concerned about data IO
bottlenecks.
Actual measurements and careful analysis beat handwaving every time :-)
My assumptions were a bit different I guess, still figuring 50MB/s per
spindle, and supposing that <clients> are computing the hashes, rather
than the servers running the disks. If that could be arranged, then
the cores available to compute hashes scale with the clients.
In any case, I am <not> arguing for deduplication for high performance
filesystems, I think it is a decent idea for backups though.
Regarding hash collisions, the relevant math is the "birthday problem"
I think. If the hash values are uniformly distributed, as they should
be, then the probability of a collision rises to about 1/2 when the
number of blocks reaches the square root of the size of the value
space. So you would have about 50% chance of a collision <somewhere>
if you have 4 billion blocks (32 bits) and are using 64 bit hashes.
If multiple hashes are independent, the you get to add the sizes
before taking the square root. 256 bit hashes ought to give
negligible odds of a collision up to 64 bits worth of blocks, where
"negligible" means much less than other sources of permanently lost
data.
However it might make more sense from a system design perspective to
use the hash as a hint, and to actually compare the data. This would
force a random block read to confirm every duplicate. Hmm, let's
convert sequential writes into random reads...
The compression ratio of 4K blocks to 16 byte hashes is also suspect,
this is 250 to one, and the incremental cost ratio of disk and ram is
not much different. So keeping the hashes in ram is probably too
expensive.
So in summary, deduplication is messy, complicated, has bad
performance, and uncertain economics for HPC. Let's not do it.
-L
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf