mikemccand commented on issue #12895: URL: https://github.com/apache/lucene/issues/12895#issuecomment-1848934726
> > I don't think this assumption is valid @gf2121? Because that floor data first contains the file pointer of the on-disk block that this prefix points to (in MSB order as of 9.9, where lots of prefix sharing should happen), so, internal arcs before the final arc are in fact expected to output shared prefix bytes? > > I thought the 'assumption' here means that we assert the floor data are all stored in the last arc. The whole FST output encoded as `[ MSBVLong | floordata ]`. We may share prefixes in MSBVLong, but we can not have two output having same `MSBVLong` so `floordata` will never be splitted into more than one arcs. Did i misunderstand something? Sorry @gf2121 -- that is indeed correct: except for the leading vLong-encoded "fp + 2 bits", the remainder of floor data will always be on the last arc. But that leading vLong has those important flags that we were losing in the LSB encoded case. > As @benwtrent pointed out, we should accumulate from the `outputPrefix` instead of `arc.output`. I raised #12900 for this. This patch seems to fix the exception when searching `WildcardQuery(new Term("body", "*fo*"))` on `Wikibig1m`. I'll try`Wikibigall` as well. +1 -- this is the right fix (to not lose any leading bytes for the FST's output in `IntersectTermsEnum`). I'll review the PR and open followon issue to somehow expose the bug with stronger BWC test. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org