mikemccand commented on issue #12895:
URL: https://github.com/apache/lucene/issues/12895#issuecomment-1848448151

   I've managed to repro (thanks @benwtrent!) and indeed the bug is happening 
because this assumption is (very, very rarely?) invalid:
   
   > This works because we can not have two same fp encoded before floor data. 
I think this assumption works well for now, otherwise we will break this 
[assert](https://github.com/apache/lucene/blob/a9b5ef474958ecee6f305bffe253b53cb58d5591/lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/SegmentTermsEnum.java#L1215)
 before meeting these exceptions.
   
   I see a case where a single byte prefix `[0xfe]` appears before the final 
arc that has the rest of the floor/frame data.  So the accumulator fails to put 
this `0xfe` first and `readVLong` then fails.  This is on a 9.8.0 written index 
where the `vLong` is still LSB encoded.
   
   I don't think this assumption is valid @gf2121?  Because that floor data 
first contains the file pointer of the on-disk block that this prefix points to 
(in MSB order as of 9.9, where lots of prefix sharing should happen), so, 
internal arcs before the final arc are in fact expected to output shared prefix 
bytes?
   
   One thing I am curious about: it's possible that turning off suffix sharing 
(a separate change: #12722) sidesteps this bug and maybe that's why we are not 
seeing it with newly created (9.9.0) indices?  We could test this by 
backporting #12722 to 9.8.x SNAPSHOT build and re-build the `wikibigall` and 
see if the corruption still happens.  I'm not saying this is a workaround or 
anything but it'd make me more comfortable if we could understand why 9.9.0 
written indices are not corrupt.  Alternatively, we could revert #12722 in 
9.9.x, rebuild wikibigall, and see if the bug then happens?  I won't be able to 
try this likely for a day or two so if someone who can repro the bug could test 
this that would be awesome :). Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to