Hello! I have a few questions about the file schema.

In schema/filewriter.go, files are split into trees of chunks at "interesting
rollsum boundaries
<https://github.com/perkeep/perkeep/blob/d9e34b748ca155eb606d1026e3f861604eb01442/pkg/schema/filewriter.go#L294>
."

The comment describing an "interesting" rollsum boundary says it's when the
trailing 13 bits of the rolling checksum are "set the same way
<https://github.com/go4org/go4/blob/132d2879e1e95dadb805c26cd339344efd1a67c8/rollsum/rollsum.go#L61-L62>,"
which sounds like it means all-zeroes or all-ones. But the implementation
says they have to be all ones
<https://github.com/go4org/go4/blob/132d2879e1e95dadb805c26cd339344efd1a67c8/rollsum/rollsum.go#L64>
.

   - Question 1: Which is right, the comment or the implementation?
   - Question 1a: If the comment is right, then the intention is for 2 out
   of every 1<<13 checksum values to satisfy OnSplit. Why not 1 out of
   every 1<<12?

This chunk splitting happens on the second and subsequent chunks of a file
after the size of the chunk surpasses 64kb, which by my calculation
<https://play.golang.org/p/cCmTYDAhpo9> happens, on average, within the
following 5,678 bytes.

   - Question 2: Why make irregularly sized chunks at all based on this
   obscure property? Why not split at 64kb boundaries?

Each chunk created gets a "bits" score which seems to be
<https://github.com/go4org/go4/blob/132d2879e1e95dadb805c26cd339344efd1a67c8/rollsum/rollsum.go#L74-L77>
the number of trailing ones in its rolling checksum (though I'm not quite
sure about that). If this is larger than the bits score of the last 1 or
more chunks, those are made "children" of this new chunk.

   - Question 3: Why?

Thanks,
- Bob

-- 
You received this message because you are subscribed to the Google Groups 
"Perkeep" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/perkeep/CAEf8c49V3gbxBXX%3DTeFnsi8kP_8tKTyZR9f26%3D_aaL-5%3DTYZFg%40mail.gmail.com.

Reply via email to