Lux, James P wrote:
Isn’t de-dupe just another flavor, conceptually, of a journaling file system..in the sense that in many systems, only a small part of the file actually changes each time, so saving “diffs” allows one to reconstruct any arbitrary version with much smaller file space.

Its really more conceptually like RLE (run length encoding) or simple compression where you start with a pattern and a dictionary, and point out where in the file that pattern repeats.

I guess the de-dupe is a bit more aggressive than that, in that it theoretically can look for common “stuff” between unrelated files, so

It only looks at raw blocks. If they have the same hash signatures (think like MD5 or SHA ... hopefully with fewer collisions), then they are duplicates.

maybe a better model is a “data compression” algorithm on the fly.

Yup this is it, but on the fly is the hard part. Doing this comparison is computationally very expensive. The hash calculations are not cheap by any measure. You most decidedly do not wish to do this on the fly ...

And for that, it’s all about trading between cost of storage space, retrieval time, and computational effort to run the algorithm.

Exactly.

(Reliability factors into it a bit.. Compression removes redundancy, after all, but the defacto redundancy provided by having previous versions around isn’t a good “system” solution, even if it’s the one people use)

:)

You get a direct CBA comparison between buying the N+1th disk, and the time/effort/money to perform this computation. In the end, the former wins.

I think one can make the argument that computation is always getting cheaper, at a faster rate than storage density or speed (because of the physics limits on the storage...), so the “span” over which you can do compression can be arbitrarily increased over time. TIFF and FAX do compression over a few bits. Zip and it’s ilk do compression over kilobits or megabits (depending on whether they build a custom symbol table). Dedupe is doing compression over Gigabits and terabits, presumably (although I assume that there’s a granularity at some point.. A dedupe system looks at symbols that are, say, 512 bytes long, as opposed to ZIP looking at 8bit symbols, or Group4 Fax looking at 1 bit symbols.

Most Dedup are over blocks, and I think most are doing 512 bytes or 4k bytes.

The point is that even if theoretically computation is getting cheaper, hash computations (the ones without collisions, as collisional hashes are ... um ... not good for Dedup), the calculation of the hash is still a significant bit of computation.

One well suited for an accelerator. Which is why the Dedup market seems to be "flooded" with accelerators (which I think are little more than FPGAs implementing some hash computation algorithm)

The hierarchical storage is really optimizing along a different axis than compression. It’s more like cache than compression.. Make the “average time to get to the next bit you need” smaller rather than “make smaller number of bits”

Basically yes ... though HSM is all about driving down the cost of the large pool as low as possible. Tape is still used, and lots of people make arguments for tape. But as John pointed out Spectra logic is marketing a SATA eating robot, so I think the days of tape are likely more numbered than before.

A brief anecdote. In 1989, a fellow graduate student was leaving for another school. He was taking his data with him. He spooled up a Vax 8650 unit with a tape. I asked him why this over other media. His response was, you can read a Vax tape anywhere.

In 2009, twenty years later, I think he might have a different take on this. I put all my bits onto floppys when I left there, and moved the important ones to spinning rust. I can still read the floppies. I doubt he can still read the tapes.

The point is that tape folks talk about longevity. But this makes a number of important assumptions about the media, the drives, and availability of replacement drives, which, as my advisor in graduate school discovered after her drive died, are not necessarily correct or accurate.

Granted, for a lot of systems, “time to get a bit” is proportional to “number of bits”

Yup.  But that initial latency can be huge.

While the cost of computation is decreasing rapidly, I'll argue that the cost of storage is decreasing as fast if not faster. This has implications for which mode is preferable ... n-plication onto decreasing cost media, or computation to minimize the cost of the ... cheap media footprint. The CBA doesnt favor Dedup in the long term, and though does favor HSM ... even cloud storage.

The issues there are bandwidth, bandwidth, and, you guessed it, bandwidth.


On 6/5/09 8:00 AM, "Joe Landman" <land...@scalableinformatics.com> wrote:

    John Hearns wrote:
    >  2009/6/5 Mark Hahn <h...@mcmaster.ca>:
    > > I'm not sure - is there some clear indication that one level of
    storage is
    > > not good enough?

    I hope I pointed this out before, but Dedup is all about reducing the
    need for the less expensive 'tier'.  Tiered storage has some merits,
    especially in the 'infinite size' storage realm.  Take some things
    offline, leave things you need online until they go dormant.  Define
    dormant on your own terms.



--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics,
email: land...@scalableinformatics.com
web  : http://scalableinformatics.com
       http://scalableinformatics.com/jackrabbit
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Reply via email to