On Jun 5, 2009, at 1:12 PM, Joe Landman wrote:
Lux, James P wrote:
It only looks at raw blocks. If they have the same hash signatures
(think like MD5 or SHA ... hopefully with fewer collisions), then
they are duplicates.
maybe a better model is a “data compression” algorithm on the fly.
Yup this is it, but on the fly is the hard part. Doing this
comparison is computationally very expensive. The hash calculations
are not cheap by any measure. You most decidedly do not wish to do
this on the fly ...
And for that, it’s all about trading between cost of storage space,
retrieval time, and computational effort to run the algorithm.
Exactly.
I think the hash calculations are pretty cheap, actually. I just
timed sha1sum on a 2.4 GHz core2 and it runs at 148 Megabytes per
second, on one core (from the disk cache). That is substantially
faster than the disk transfer rate. If you have a parallel
filesystem, you can parallize the hashes as well.
-L
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf