On Jun 5, 2009, at 1:12 PM, Joe Landman wrote:

Lux, James P wrote:

It only looks at raw blocks. If they have the same hash signatures (think like MD5 or SHA ... hopefully with fewer collisions), then they are duplicates.

maybe a better model is a  “data compression” algorithm on the fly.

Yup this is it, but on the fly is the hard part. Doing this comparison is computationally very expensive. The hash calculations are not cheap by any measure. You most decidedly do not wish to do this on the fly ...

And for that, it’s all about trading between cost of storage space, retrieval time, and computational effort to run the algorithm.

Exactly.


I think the hash calculations are pretty cheap, actually. I just timed sha1sum on a 2.4 GHz core2 and it runs at 148 Megabytes per second, on one core (from the disk cache). That is substantially faster than the disk transfer rate. If you have a parallel filesystem, you can parallize the hashes as well.

-L


_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Reply via email to