mikemccand commented on issue #12895: URL: https://github.com/apache/lucene/issues/12895#issuecomment-1848448151
I've managed to repro (thanks @benwtrent!) and indeed the bug is happening because this assumption is (very, very rarely?) invalid: > This works because we can not have two same fp encoded before floor data. I think this assumption works well for now, otherwise we will break this [assert](https://github.com/apache/lucene/blob/a9b5ef474958ecee6f305bffe253b53cb58d5591/lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/SegmentTermsEnum.java#L1215) before meeting these exceptions. I see a case where a single byte prefix `[0xfe]` appears before the final arc that has the rest of the floor/frame data. So the accumulator fails to put this `0xfe` first and `readVLong` then fails. This is on a 9.8.0 written index where the `vLong` is still LSB encoded. I don't think this assumption is valid @gf2121? Because that floor data first contains the file pointer of the on-disk block that this prefix points to (in MSB order as of 9.9, where lots of prefix sharing should happen), so, internal arcs before the final arc are in fact expected to output shared prefix bytes? One thing I am curious about: it's possible that turning off suffix sharing (a separate change: #12722) sidesteps this bug and maybe that's why we are not seeing it with newly created (9.9.0) indices? We could test this by backporting #12722 to 9.8.x SNAPSHOT build and re-build the `wikibigall` and see if the corruption still happens. I'm not saying this is a workaround or anything but it'd make me more comfortable if we could understand why 9.9.0 written indices are not corrupt. Alternatively, we could revert #12722 in 9.9.x, rebuild wikibigall, and see if the bug then happens? I won't be able to try this likely for a day or two so if someone who can repro the bug could test this that would be awesome :). Thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org