Re: File schema

Bob Glickstein Sun, 15 Dec 2019 17:11:59 -0800

Very helpful - thank you!

Cheers,
- Bob


On Sat, Dec 14, 2019 at 11:07 AM Yuval Kogman <[email protected]>
wrote:

> Hi,
>
> On Sat, 14 Dec 2019 at 17:13, Bob Glickstein <[email protected]>
> wrote:
>
>>
>>    - Question 1: Which is right, the comment or the implementation?
>>    - Question 1a: If the comment is right, then the intention is for 2
>>    out of every 1<<13 checksum values to satisfy OnSplit. Why not 1 out
>>    of every 1<<12?
>>
>> I will refrain from addressing these, since it's just a parameter, and I
> don't know what the original motivations for setting it that way were
>
> This chunk splitting happens on the second and subsequent chunks of a file
>> after the size of the chunk surpasses 64kb, which by my calculation
>> <https://play.golang.org/p/cCmTYDAhpo9> happens, on average, within the
>> following 5,678 bytes.
>>
>>    - Question 2: Why make irregularly sized chunks at all based on this
>>    obscure property? Why not split at 64kb boundaries?
>>
>> A somewhat contrived but helpful example: suppose you have a very large
> text file. You store it in perkeep. Then you find a typo in the beginning,
> fix it, and store the new version. Although the files are mostly identical,
> with a Levenstein distance of 1, a fixed width approach will encode them
> into completely disjoint sets of chunks.
>
> By using subregions of the contents to determine the boundary positions,
> large strings with identical contents will be split at different offsets if
> the data at those offsets are the same.
>
> In this example, the 1st chunk will be different for both files (with high
> probability) all subsequent chunks will be identical, allowing the data to
> be deduplicated. The tradeoff is the complexity of variable sized chunks,
> and computing the hashes over these sliding windows, and furthermore this
> is convergent independently of the order in which the data was inserted
> (unlike say Git which depends on the commit ordering for deduplicating
> similar but non identical files when constructing pack files). There are
> many variants of this technique, I believe the idea originated as Rabin
> fingerprinting, and that rsync was the first prominent application
> specifically intended for a distributed setting (don't quote me on that
> though, I did not verify).
>
> A less contrived example might be large documents with embedded resources
> (e.g. a slideshow with images), or binary files with editable metadata
> (although I believe formats I place such metadata at the end)
>
> Each chunk created gets a "bits" score which seems to be
>> <https://github.com/go4org/go4/blob/132d2879e1e95dadb805c26cd339344efd1a67c8/rollsum/rollsum.go#L74-L77>
>> the number of trailing ones in its rolling checksum (though I'm not quite
>> sure about that). If this is larger than the bits score of the last 1 or
>> more chunks, those are made "children" of this new chunk.
>>
>>    - Question 3: Why?
>>
>> Having stared at the code for a bit, I'm not sure exactly why this tree
> structure is computed the way that it is.
>
> The rationale for the Bytes schema object being recursive is that if the
> chunks of a file/bytes object were encoded as a linear sequence, there
> would be no possibility of reusing intermediate nodes.
>
> If I'm not mistaken, this logic organizes this tree structure based on the
> bit score just to retain the same kind of order insensitivity as the
> chunking itself, and using the bits score is just a stochastic approach to
> get a structure that is roughly balanced, by arbitrarily promoting nodes
> with progressively rarer bit scores (but equally some other metric of how
> to group the chunks could be used, c.f. prolly trees in noms
> <https://github.com/attic-labs/noms/blob/master/doc/intro.md#prolly-trees-probabilistic-b-trees>).
> I'm sure someone will correct me if I've missed something.
>
> --
> You received this message because you are subscribed to the Google Groups
> "Perkeep" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/perkeep/CAAQdECC5nNF1xCTNt%3DjD5_GYOfrFbSKq2F3eZ1jP9SHbeD0opg%40mail.gmail.com
> <https://groups.google.com/d/msgid/perkeep/CAAQdECC5nNF1xCTNt%3DjD5_GYOfrFbSKq2F3eZ1jP9SHbeD0opg%40mail.gmail.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"Perkeep" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/perkeep/CAEf8c4-8q0R4RG8kK1_UT7JLgYXrfXOyY3bK5nRyGjDfp0FrPQ%40mail.gmail.com.

Re: File schema

Reply via email to