Lawrence Stewart wrote:

[...]

Yup this is it, but on the fly is the hard part. Doing this comparison is computationally very expensive. The hash calculations are not cheap by any measure. You most decidedly do not wish to do this on the fly ...

The assumption of a high performance disk/file system is implicit here.


And for that, it’s all about trading between cost of storage space, retrieval time, and computational effort to run the algorithm.

Exactly.


I think the hash calculations are pretty cheap, actually. I just timed sha1sum on a 2.4 GHz core2 and it runs at 148 Megabytes per second, on one core (from the disk cache). That is substantially faster than the disk transfer rate. If you have a parallel filesystem, you can parallize the hashes as well.

Disk transfer rates are 100-120 MB/s these days. For high performance local file systems (N * disks), the data rates you showed for sha1 won't cut it. Especially if they have multiple hash signatures computed in order to avoid hash collisions (ala MD5 et al). The probability of two different hashes having two or more unique blocks that have a collision in the same manner is somewhat smaller than the probability that a single hash function has two (or more) unique blocks that have a collision. So you compute two or more in different ways, to reduce the probability of that collision.

For laughs, I just tried this on our lab server. We have a 500 MB/s file system attached to it, so we aren't so concerned about data IO bottlenecks.

First run, not in cache:

land...@crunch:/data/big/isos/centos/centos5$ /usr/bin/time cat CentOS-5.3-x86_64-bin-DVD.iso |sha1sum
0.02user 12.88system 0:47.22elapsed 27%CPU (0avgtext+0avgdata 0maxresident)k
8901280inputs+0outputs (0major+2259minor)pagefaults 0swaps
f8ca12b4acc714f4e4a21f3f35af083952ab46e0  -


second run, in cache:
land...@crunch:/data/big/isos/centos/centos5$ /usr/bin/time cat CentOS-5.3-x86_64-bin-DVD.iso |sha1sum
0.01user 8.79system 0:37.47elapsed 23%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+2259minor)pagefaults 0swaps
f8ca12b4acc714f4e4a21f3f35af083952ab46e0  -


The null version of the hash, just writing to /dev/null:

land...@crunch:/data/big/isos/centos/centos5$ /usr/bin/time cat CentOS-5.3-x86_64-bin-DVD.iso | cat - > /dev/null
0.00user 8.37system 0:10.96elapsed 76%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+2258minor)pagefaults 0swaps

So ~11 seconds of the results here are dealing with the pipes and internal unix bits.

Removing the pipe

land...@crunch:/data/big/isos/centos/centos5$ /usr/bin/time cat CentOS-5.3-x86_64-bin-DVD.iso > /dev/null
0.00user 4.95system 0:04.94elapsed 100%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+2258minor)pagefaults 0swaps

so its about 6 seconds cost to use the pipe when data is in cache,


and the raw filesystem w/o cache

land...@crunch:/data/big/isos/centos/centos5$ dd if=CentOS-5.3-x86_64-bin-DVD.iso of=/dev/null iflag=direct bs=32M
135+1 records in
135+1 records out
4557455360 bytes (4.6 GB) copied, 8.72511 s, 522 MB/s

Where I am getting at with this, is every added layer costs you time, whether or not the data is in cache or on disk. Anything getting in path of moving data to disk, every pipe, any hash/checksum you compute is going to cost you. The more hashes/checksums you compute, the more time you spend computing. And the less time moving data (for a fixed window of time). Which lowers your effective bandwidth.


Of course, that isn't the only thing you have to worry about. You have to do a hash table lookup as well for dedup. While hash table lookups are pretty efficient, you may need to queue up a hash table insertion if the block wasn't found. And you will need lots of ram to hold these hash tables. Parallel hash table insertion is possible. Parallel lookup is possible. Get N machines operating on one query should be N times faster if you can partition your table N ways. Get M simultaneous queries going, with table inserts, etc. ... well, maybe not so much fast.

This is part of the reason that Dedup is used in serial backup pathways, usually with dedicated controller boxes. You can leverage software to do this, but you are going to allocate lots of resources to get the hash table computation, lookup and table management going fast.

My point was that its hard to do this on the fly in general. You are right that some aspects can be easilydone in parallel. But there are elements that can't be, and the hash computations are still expensive on modern processors.

Haven't tried it on the Nehalem yet, but I don't expect much difference.

--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics,
email: land...@scalableinformatics.com
web  : http://scalableinformatics.com
       http://scalableinformatics.com/jackrabbit
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Reply via email to