The question is Joe,

Why are you storing it uncompressed?

Vincent

On Oct 3, 2008, at 5:45 PM, Joe Landman wrote:

Carsten Aulbert wrote:

If 7-zip can only compress data at a rate of less than say 5 MB/s (input data) I can much much faster copy the data over uncompressed regardless of how many unused cores I have in the system. Exactly for these cases I would like to use all cores available to compress the data fast in order
to increase the throughput.

This is fundamentally the issue. If the compression time plus the tranmit time for the compressed data is greater than the transmit time for the uncompressed data, then the compression may not be worth it. Sure, if it is nothing but text files, you may get 60-80+ % compression rates. But for the case of (non-pathological) binary data, it might be only a few percent. So in this case, even if you could get a few percent delta from the compression, is that worth all the extra time you spend to get it?

At the end of the day the question is how much lossless compression can you do in a short enough time for it to be meaningful in terms of transmitting the data?

Do I miss something vital?

Nope.  You got it nailed.

Several months ago, I tried moving about 600 GB of data from an old server to a JackRabbit. The old server and the JackRabbit had a gigabit link between them. We regularly saw 45 MB scp rates (one of the chips in the older server was a Broadcom).

I tried this with and without compression. With compression (simple gzip), the copy took something like 28 hours ( a little more than a day). Without compression, it was well under 10 hours.

In this case, compression (gzip) was not worth it. The command I used for the test was

uncompressed:

        cd /directory
        tar -cpf - ./ | ssh jackrabbit "cd /directory ; tar -xpvf - "

compressed:

        cd /directory
        tar -czpf - ./ | ssh jackrabbit "cd /directory ; tar -xzpvf - "

if you want to spend more time, use "j" rather than "z" in the options.

YMMV, but I have been convinced that, apart from specific use cases with text only documents or documents known to compress quickly/ well, that compression prior to transfer may waste more time than it saves.

This said, if someone has a parallel hack of gzip or similar we can pipe through, by all means, I would be happy to try it. But it would have to be pretty darned efficient.

100MB/s means 1 byte transmitted,on average, in 10 nanoseconds. Which means for compression to be meaningful, you would need to compute for less time than that to increase the information density. Put another way, 1 MB takes about 10 ms to send over a gigabit link. For compression to be meaningful, you need to compress this 1MB in far less than 10 ms and still transmit it in that time. Assuming that any compression algorithm has to walk through data at least once, A 1 GB/s memory subsystem takes about 1 ms to walk through this data at least once, so you need as few passes as possible through the data set to construct the compressed representation, as you will still have on the order of 1E+5 bytes to send.

I am not saying it is hopeless, just hard for complex compression schemes (bzip2, etc). When we get enough firepower in the CPU (or maybe GPU ... hmmmm) the situation may improve.

GPU as a compression engine?  Interesting ...

Joe

Cheers
Carsten

--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: [EMAIL PROTECTED]
web  : http://www.scalableinformatics.com
       http://jackrabbit.scalableinformatics.com
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615


_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Reply via email to