mikemccand commented on issue #12895:
URL: https://github.com/apache/lucene/issues/12895#issuecomment-1848934191

   OK I think I understand why we see the bug on 9.8.x indices but NOT 9.9.x. 
indices.  And I'm quite sure blast radius of the bug is contained solely to 
read-time, i.e. newly written 9.9.0 indices are not corrupt.  Though I'd really 
love to see some stronger testing -- I'm hoping a randomly written large enough 
index might reveal the bug.  I'll open a spinoff issue so we can create that.
   
   Details:
   
   We write FST with <term-prefix,byte[]> pairs, where term-prefix is the 
prefix of a set of terms sharing one on-disk block.  E.g. `foo` is prefix, and 
the N terms would be `foobar`, `foobaz`, ...  The start of that `byte[]` is a 
`vLong`, which is the absolute file pointer of the on-disk block shifted up by 
2 bits, and 2 lower bits are two important flags: `OUTPUT_FLAG_IS_FLOOR` and 
`OUTPUT_FLAG_HAS_TERMS`.
   
   Now, with 9.8.x, this is LSB encoded: those two flags are in the leading 
byte, not the last byte.  FST will share output byte prefixes when it can, and 
it does in fact happen (rarely!) that the LSB (6 bits + 2 bit flags) is the 
same across N term blocks.  It can happen if the last 6 bits of their block 
file pointers happen to be the same.  In this case that leading byte (or 
possibly, even more rarely, bytes) are not all on the last arc, and this PR 
loses that leading flag-containing byte/bytes in that case, and computes the 
wrong flag value for `OUTPUT_FLAG_IS_FLOOR`, then tries to readVInt when non 
was written --> exceptions at read time.
   
   @gf2121 indeed the leading fp + 2-bit flags will never be the same across 
blocks, so this means even if FST is able to share leading prefix byte or two, 
it will never share that entire `vLong`: there must be at least one final 
(different) byte for `readVLong` to read.  And because `vLong` is 
self-delineating (each byte knows whether it is the last one by checking 8th 
bit is clear) losing a byte or two of its leading bytes won't break reading of 
the `vLong`, i.e., `readVLong` will indeed stop at the right byte (but of 
course produce the wrong value).
   
   Importantly, `IntersectTermsEnum` does not use this encoded file pointer in 
the `floorData`!  It gets the file-pointer by traversing prior arcs in the FST 
and accumulating N incremental diffs contained in the on-disk block entries 
leading to that final floor block.  So `IntersectTermsEnum` just needs these 
two big flags.
   
   In the 9.9.x written index, where we instead encode that leading long fp + 2 
bits in MSB order, we share tons of prefixes of course (this is why FST got so 
much smaller), but, we will never share the entire thing, and that last byte 
contains our bit flags.  So at read-time we will always have the right (flag 
containing) byte.  The 9.9.x index is intact, and reading it is buggy yet buggy 
in a harmless way (throwing away bytes it does not try to use), and also 
because of how `vLong` is self delineating.  Fun!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to