mikemccand commented on issue #12895: URL: https://github.com/apache/lucene/issues/12895#issuecomment-1848934191
OK I think I understand why we see the bug on 9.8.x indices but NOT 9.9.x. indices. And I'm quite sure blast radius of the bug is contained solely to read-time, i.e. newly written 9.9.0 indices are not corrupt. Though I'd really love to see some stronger testing -- I'm hoping a randomly written large enough index might reveal the bug. I'll open a spinoff issue so we can create that. Details: We write FST with <term-prefix,byte[]> pairs, where term-prefix is the prefix of a set of terms sharing one on-disk block. E.g. `foo` is prefix, and the N terms would be `foobar`, `foobaz`, ... The start of that `byte[]` is a `vLong`, which is the absolute file pointer of the on-disk block shifted up by 2 bits, and 2 lower bits are two important flags: `OUTPUT_FLAG_IS_FLOOR` and `OUTPUT_FLAG_HAS_TERMS`. Now, with 9.8.x, this is LSB encoded: those two flags are in the leading byte, not the last byte. FST will share output byte prefixes when it can, and it does in fact happen (rarely!) that the LSB (6 bits + 2 bit flags) is the same across N term blocks. It can happen if the last 6 bits of their block file pointers happen to be the same. In this case that leading byte (or possibly, even more rarely, bytes) are not all on the last arc, and this PR loses that leading flag-containing byte/bytes in that case, and computes the wrong flag value for `OUTPUT_FLAG_IS_FLOOR`, then tries to readVInt when non was written --> exceptions at read time. @gf2121 indeed the leading fp + 2-bit flags will never be the same across blocks, so this means even if FST is able to share leading prefix byte or two, it will never share that entire `vLong`: there must be at least one final (different) byte for `readVLong` to read. And because `vLong` is self-delineating (each byte knows whether it is the last one by checking 8th bit is clear) losing a byte or two of its leading bytes won't break reading of the `vLong`, i.e., `readVLong` will indeed stop at the right byte (but of course produce the wrong value). Importantly, `IntersectTermsEnum` does not use this encoded file pointer in the `floorData`! It gets the file-pointer by traversing prior arcs in the FST and accumulating N incremental diffs contained in the on-disk block entries leading to that final floor block. So `IntersectTermsEnum` just needs these two big flags. In the 9.9.x written index, where we instead encode that leading long fp + 2 bits in MSB order, we share tons of prefixes of course (this is why FST got so much smaller), but, we will never share the entire thing, and that last byte contains our bit flags. So at read-time we will always have the right (flag containing) byte. The 9.9.x index is intact, and reading it is buggy yet buggy in a harmless way (throwing away bytes it does not try to use), and also because of how `vLong` is self delineating. Fun! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org