On Wed, 3 Jun 2009, Joe Landman wrote:
It might be worth noting that dedup is not intended for high performance file systems ... the cost of computing the hash(es) is(are) huge.
Some file systems do (or claim to do) checksumming for data integrity purposes, this seems to me like the perfect place to add the computation of a hash - with data in cache (needed for checksumming anyay), the computation should be fast. This would allow runtime detection of duplicates, but would make detection of duplicates between file systems or for backup more cumbersome as the hashes would need to be exported somehow from the file system.
One issue that was not mentioned yet is the strength/length of the hash - within one file system, it's known what are the limitations of number of blocks, files, file sizes, etc. and the hash can be chosen such that there are no collisions. By taking an arbitrarily large number of blocks/files as can be available on a machine or network with many large devices or file systems, the same guarantee doesn't hold anymore.
-- Bogdan Costescu IWR, University of Heidelberg, INF 368, D-69120 Heidelberg, Germany Phone: +49 6221 54 8240, Fax: +49 6221 54 8850 E-mail: bogdan.coste...@iwr.uni-heidelberg.de _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf