Re: [Beowulf] dedupe filesystem

Lawrence Stewart Fri, 05 Jun 2009 12:16:55 -0700


On Jun 5, 2009, at 1:12 PM, Joe Landman wrote:

Lux, James P wrote:
It only looks at raw blocks. If they have the same hash signatures(think like MD5 or SHA ... hopefully with fewer collisions), thenthey are duplicates.
maybe a better model is a  “data compression” algorithm on the fly.
Yup this is it, but on the fly is the hard part. Doing thiscomparison is computationally very expensive. The hash calculationsare not cheap by any measure. You most decidedly do not wish to dothis on the fly ...
And for that, it’s all about trading between cost of storage space,retrieval time, and computational effort to run the algorithm.
Exactly.

I think the hash calculations are pretty cheap, actually. I justtimed sha1sum on a 2.4 GHz core2 and it runs at 148 Megabytes persecond, on one core (from the disk cache). That is substantiallyfaster than the disk transfer rate. If you have a parallelfilesystem, you can parallize the hashes as well.


-L


_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] dedupe filesystem

Reply via email to