Re: [Beowulf] dedupe filesystem

Joe Landman Sun, 07 Jun 2009 22:11:07 -0700

Lawrence Stewart wrote:

[...]

Yup this is it, but on the fly is the hard part. Doing thiscomparison is computationally very expensive. The hash calculationsare not cheap by any measure. You most decidedly do not wish to dothis on the fly ...


The assumption of a high performance disk/file system is implicit here.

And for that, it’s all about trading between cost of storage space,retrieval time, and computational effort to run the algorithm.
Exactly.
I think the hash calculations are pretty cheap, actually. I just timedsha1sum on a 2.4 GHz core2 and it runs at 148 Megabytes per second, onone core (from the disk cache). That is substantially faster than thedisk transfer rate. If you have a parallel filesystem, you canparallize the hashes as well.

Disk transfer rates are 100-120 MB/s these days. For high performancelocal file systems (N * disks), the data rates you showed for sha1 won'tcut it. Especially if they have multiple hash signatures computed inorder to avoid hash collisions (ala MD5 et al). The probability of twodifferent hashes having two or more unique blocks that have a collisionin the same manner is somewhat smaller than the probability that asingle hash function has two (or more) unique blocks that have acollision. So you compute two or more in different ways, to reduce theprobability of that collision.

For laughs, I just tried this on our lab server. We have a 500 MB/sfile system attached to it, so we aren't so concerned about data IObottlenecks.


First run, not in cache:

land...@crunch:/data/big/isos/centos/centos5$ /usr/bin/time catCentOS-5.3-x86_64-bin-DVD.iso |sha1sum

0.02user 12.88system 0:47.22elapsed 27%CPU (0avgtext+0avgdata 0maxresident)k
8901280inputs+0outputs (0major+2259minor)pagefaults 0swaps
f8ca12b4acc714f4e4a21f3f35af083952ab46e0  -


second run, in cache:

land...@crunch:/data/big/isos/centos/centos5$ /usr/bin/time catCentOS-5.3-x86_64-bin-DVD.iso |sha1sum

0.01user 8.79system 0:37.47elapsed 23%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+2259minor)pagefaults 0swaps
f8ca12b4acc714f4e4a21f3f35af083952ab46e0  -


The null version of the hash, just writing to /dev/null:

land...@crunch:/data/big/isos/centos/centos5$ /usr/bin/time catCentOS-5.3-x86_64-bin-DVD.iso | cat - > /dev/null

0.00user 8.37system 0:10.96elapsed 76%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+2258minor)pagefaults 0swaps

So ~11 seconds of the results here are dealing with the pipes andinternal unix bits.


Removing the pipe

land...@crunch:/data/big/isos/centos/centos5$ /usr/bin/time catCentOS-5.3-x86_64-bin-DVD.iso > /dev/null

0.00user 4.95system 0:04.94elapsed 100%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+2258minor)pagefaults 0swaps

so its about 6 seconds cost to use the pipe when data is in cache,


and the raw filesystem w/o cache

land...@crunch:/data/big/isos/centos/centos5$ ddif=CentOS-5.3-x86_64-bin-DVD.iso of=/dev/null iflag=direct bs=32M

135+1 records in
135+1 records out
4557455360 bytes (4.6 GB) copied, 8.72511 s, 522 MB/s

Where I am getting at with this, is every added layer costs you time,whether or not the data is in cache or on disk. Anything getting inpath of moving data to disk, every pipe, any hash/checksum you computeis going to cost you. The more hashes/checksums you compute, the moretime you spend computing. And the less time moving data (for a fixedwindow of time). Which lowers your effective bandwidth.

Of course, that isn't the only thing you have to worry about. You haveto do a hash table lookup as well for dedup. While hash table lookupsare pretty efficient, you may need to queue up a hash table insertion ifthe block wasn't found. And you will need lots of ram to hold thesehash tables. Parallel hash table insertion is possible. Parallellookup is possible. Get N machines operating on one query should be Ntimes faster if you can partition your table N ways. Get M simultaneousqueries going, with table inserts, etc. ... well, maybe not so much fast.

This is part of the reason that Dedup is used in serial backup pathways,usually with dedicated controller boxes. You can leverage software todo this, but you are going to allocate lots of resources to get the hashtable computation, lookup and table management going fast.

My point was that its hard to do this on the fly in general. You areright that some aspects can be easilydone in parallel. But there areelements that can't be, and the hash computations are still expensive onmodern processors.


Haven't tried it on the Nehalem yet, but I don't expect much difference.

--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics,
email: land...@scalableinformatics.com
web  : http://scalableinformatics.com
       http://scalableinformatics.com/jackrabbit
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] dedupe filesystem

Reply via email to