Very helpful - thank you! Cheers, - Bob
On Sat, Dec 14, 2019 at 11:07 AM Yuval Kogman <[email protected]> wrote: > Hi, > > On Sat, 14 Dec 2019 at 17:13, Bob Glickstein <[email protected]> > wrote: > >> >> - Question 1: Which is right, the comment or the implementation? >> - Question 1a: If the comment is right, then the intention is for 2 >> out of every 1<<13 checksum values to satisfy OnSplit. Why not 1 out >> of every 1<<12? >> >> I will refrain from addressing these, since it's just a parameter, and I > don't know what the original motivations for setting it that way were > > This chunk splitting happens on the second and subsequent chunks of a file >> after the size of the chunk surpasses 64kb, which by my calculation >> <https://play.golang.org/p/cCmTYDAhpo9> happens, on average, within the >> following 5,678 bytes. >> >> - Question 2: Why make irregularly sized chunks at all based on this >> obscure property? Why not split at 64kb boundaries? >> >> A somewhat contrived but helpful example: suppose you have a very large > text file. You store it in perkeep. Then you find a typo in the beginning, > fix it, and store the new version. Although the files are mostly identical, > with a Levenstein distance of 1, a fixed width approach will encode them > into completely disjoint sets of chunks. > > By using subregions of the contents to determine the boundary positions, > large strings with identical contents will be split at different offsets if > the data at those offsets are the same. > > In this example, the 1st chunk will be different for both files (with high > probability) all subsequent chunks will be identical, allowing the data to > be deduplicated. The tradeoff is the complexity of variable sized chunks, > and computing the hashes over these sliding windows, and furthermore this > is convergent independently of the order in which the data was inserted > (unlike say Git which depends on the commit ordering for deduplicating > similar but non identical files when constructing pack files). There are > many variants of this technique, I believe the idea originated as Rabin > fingerprinting, and that rsync was the first prominent application > specifically intended for a distributed setting (don't quote me on that > though, I did not verify). > > A less contrived example might be large documents with embedded resources > (e.g. a slideshow with images), or binary files with editable metadata > (although I believe formats I place such metadata at the end) > > Each chunk created gets a "bits" score which seems to be >> <https://github.com/go4org/go4/blob/132d2879e1e95dadb805c26cd339344efd1a67c8/rollsum/rollsum.go#L74-L77> >> the number of trailing ones in its rolling checksum (though I'm not quite >> sure about that). If this is larger than the bits score of the last 1 or >> more chunks, those are made "children" of this new chunk. >> >> - Question 3: Why? >> >> Having stared at the code for a bit, I'm not sure exactly why this tree > structure is computed the way that it is. > > The rationale for the Bytes schema object being recursive is that if the > chunks of a file/bytes object were encoded as a linear sequence, there > would be no possibility of reusing intermediate nodes. > > If I'm not mistaken, this logic organizes this tree structure based on the > bit score just to retain the same kind of order insensitivity as the > chunking itself, and using the bits score is just a stochastic approach to > get a structure that is roughly balanced, by arbitrarily promoting nodes > with progressively rarer bit scores (but equally some other metric of how > to group the chunks could be used, c.f. prolly trees in noms > <https://github.com/attic-labs/noms/blob/master/doc/intro.md#prolly-trees-probabilistic-b-trees>). > I'm sure someone will correct me if I've missed something. > > -- > You received this message because you are subscribed to the Google Groups > "Perkeep" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To view this discussion on the web visit > https://groups.google.com/d/msgid/perkeep/CAAQdECC5nNF1xCTNt%3DjD5_GYOfrFbSKq2F3eZ1jP9SHbeD0opg%40mail.gmail.com > <https://groups.google.com/d/msgid/perkeep/CAAQdECC5nNF1xCTNt%3DjD5_GYOfrFbSKq2F3eZ1jP9SHbeD0opg%40mail.gmail.com?utm_medium=email&utm_source=footer> > . > -- You received this message because you are subscribed to the Google Groups "Perkeep" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/perkeep/CAEf8c4-8q0R4RG8kK1_UT7JLgYXrfXOyY3bK5nRyGjDfp0FrPQ%40mail.gmail.com.
