Hi,

On Sat, 14 Dec 2019 at 17:13, Bob Glickstein <[email protected]>
wrote:

>
>    - Question 1: Which is right, the comment or the implementation?
>    - Question 1a: If the comment is right, then the intention is for 2
>    out of every 1<<13 checksum values to satisfy OnSplit. Why not 1 out
>    of every 1<<12?
>
> I will refrain from addressing these, since it's just a parameter, and I
don't know what the original motivations for setting it that way were

This chunk splitting happens on the second and subsequent chunks of a file
> after the size of the chunk surpasses 64kb, which by my calculation
> <https://play.golang.org/p/cCmTYDAhpo9> happens, on average, within the
> following 5,678 bytes.
>
>    - Question 2: Why make irregularly sized chunks at all based on this
>    obscure property? Why not split at 64kb boundaries?
>
> A somewhat contrived but helpful example: suppose you have a very large
text file. You store it in perkeep. Then you find a typo in the beginning,
fix it, and store the new version. Although the files are mostly identical,
with a Levenstein distance of 1, a fixed width approach will encode them
into completely disjoint sets of chunks.

By using subregions of the contents to determine the boundary positions,
large strings with identical contents will be split at different offsets if
the data at those offsets are the same.

In this example, the 1st chunk will be different for both files (with high
probability) all subsequent chunks will be identical, allowing the data to
be deduplicated. The tradeoff is the complexity of variable sized chunks,
and computing the hashes over these sliding windows, and furthermore this
is convergent independently of the order in which the data was inserted
(unlike say Git which depends on the commit ordering for deduplicating
similar but non identical files when constructing pack files). There are
many variants of this technique, I believe the idea originated as Rabin
fingerprinting, and that rsync was the first prominent application
specifically intended for a distributed setting (don't quote me on that
though, I did not verify).

A less contrived example might be large documents with embedded resources
(e.g. a slideshow with images), or binary files with editable metadata
(although I believe formats I place such metadata at the end)

Each chunk created gets a "bits" score which seems to be
> <https://github.com/go4org/go4/blob/132d2879e1e95dadb805c26cd339344efd1a67c8/rollsum/rollsum.go#L74-L77>
> the number of trailing ones in its rolling checksum (though I'm not quite
> sure about that). If this is larger than the bits score of the last 1 or
> more chunks, those are made "children" of this new chunk.
>
>    - Question 3: Why?
>
> Having stared at the code for a bit, I'm not sure exactly why this tree
structure is computed the way that it is.

The rationale for the Bytes schema object being recursive is that if the
chunks of a file/bytes object were encoded as a linear sequence, there
would be no possibility of reusing intermediate nodes.

If I'm not mistaken, this logic organizes this tree structure based on the
bit score just to retain the same kind of order insensitivity as the
chunking itself, and using the bits score is just a stochastic approach to
get a structure that is roughly balanced, by arbitrarily promoting nodes
with progressively rarer bit scores (but equally some other metric of how
to group the chunks could be used, c.f. prolly trees in noms
<https://github.com/attic-labs/noms/blob/master/doc/intro.md#prolly-trees-probabilistic-b-trees>).
I'm sure someone will correct me if I've missed something.

-- 
You received this message because you are subscribed to the Google Groups 
"Perkeep" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/perkeep/CAAQdECC5nNF1xCTNt%3DjD5_GYOfrFbSKq2F3eZ1jP9SHbeD0opg%40mail.gmail.com.

Reply via email to