On Jun 8, 2009, at 12:55 AM, Joe Landman wrote:

Lawrence Stewart wrote:

[...]

Yup this is it, but on the fly is the hard part. Doing this comparison is computationally very expensive. The hash calculations are not cheap by any measure. You most decidedly do not wish to do this on the fly ...

The assumption of a high performance disk/file system is implicit here.


And for that, it’s all about trading between cost of storage space, retrieval time, and computational effort to run the algorithm.

Exactly.
I think the hash calculations are pretty cheap, actually. I just timed sha1sum on a 2.4 GHz core2 and it runs at 148 Megabytes per second, on one core (from the disk cache). That is substantially faster than the disk transfer rate. If you have a parallel filesystem, you can parallize the hashes as well.

Disk transfer rates are 100-120 MB/s these days. For high performance local file systems (N * disks), the data rates you showed for sha1 won't cut it. Especially if they have multiple hash signatures computed in order to avoid hash collisions (ala MD5 et al). The probability of two different hashes having two or more unique blocks that have a collision in the same manner is somewhat smaller than the probability that a single hash function has two (or more) unique blocks that have a collision. So you compute two or more in different ways, to reduce the probability of that collision.

For laughs, I just tried this on our lab server. We have a 500 MB/s file system attached to it, so we aren't so concerned about data IO bottlenecks.


Actual measurements and careful analysis beat handwaving every time :-)

My assumptions were a bit different I guess, still figuring 50MB/s per spindle, and supposing that <clients> are computing the hashes, rather than the servers running the disks. If that could be arranged, then the cores available to compute hashes scale with the clients.

In any case, I am <not> arguing for deduplication for high performance filesystems, I think it is a decent idea for backups though.

Regarding hash collisions, the relevant math is the "birthday problem" I think. If the hash values are uniformly distributed, as they should be, then the probability of a collision rises to about 1/2 when the number of blocks reaches the square root of the size of the value space. So you would have about 50% chance of a collision <somewhere> if you have 4 billion blocks (32 bits) and are using 64 bit hashes. If multiple hashes are independent, the you get to add the sizes before taking the square root. 256 bit hashes ought to give negligible odds of a collision up to 64 bits worth of blocks, where "negligible" means much less than other sources of permanently lost data.

However it might make more sense from a system design perspective to use the hash as a hint, and to actually compare the data. This would force a random block read to confirm every duplicate. Hmm, let's convert sequential writes into random reads...

The compression ratio of 4K blocks to 16 byte hashes is also suspect, this is 250 to one, and the incremental cost ratio of disk and ram is not much different. So keeping the hashes in ram is probably too expensive.

So in summary, deduplication is messy, complicated, has bad performance, and uncertain economics for HPC. Let's not do it.

-L


_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Reply via email to