Re: [Beowulf] dedupe filesystem

Bogdan Costescu Wed, 03 Jun 2009 08:46:17 -0700

On Wed, 3 Jun 2009, Joe Landman wrote:

It might be worth noting that dedup is not intended for highperformance file systems ... the cost of computing the hash(es)is(are) huge.

Some file systems do (or claim to do) checksumming for data integritypurposes, this seems to me like the perfect place to add thecomputation of a hash - with data in cache (needed for checksumminganyay), the computation should be fast. This would allow runtimedetection of duplicates, but would make detection of duplicatesbetween file systems or for backup more cumbersome as the hashes wouldneed to be exported somehow from the file system.

One issue that was not mentioned yet is the strength/length of thehash - within one file system, it's known what are the limitations ofnumber of blocks, files, file sizes, etc. and the hash can be chosensuch that there are no collisions. By taking an arbitrarily largenumber of blocks/files as can be available on a machine or networkwith many large devices or file systems, the same guarantee doesn'thold anymore.


--
Bogdan Costescu

IWR, University of Heidelberg, INF 368, D-69120 Heidelberg, Germany
Phone: +49 6221 54 8240, Fax: +49 6221 54 8850
E-mail: bogdan.coste...@iwr.uni-heidelberg.de
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] dedupe filesystem

Reply via email to