mikemccand commented on issue #12895:
URL: https://github.com/apache/lucene/issues/12895#issuecomment-1848934726
> > I don't think this assumption is valid @gf2121? Because that floor data
first contains the file pointer of the on-disk block that this prefix points to
(in MSB order as of 9.9, where lots of prefix sharing should happen), so,
internal arcs before the final arc are in fact expected to output shared prefix
bytes?
>
> I thought the 'assumption' here means that we assert the floor data are
all stored in the last arc. The whole FST output encoded as `[ MSBVLong |
floordata ]`. We may share prefixes in MSBVLong, but we can not have two output
having same `MSBVLong` so `floordata` will never be splitted into more than one
arcs. Did i misunderstand something?
Sorry @gf2121 -- that is indeed correct: except for the leading
vLong-encoded "fp + 2 bits", the remainder of floor data will always be on the
last arc. But that leading vLong has those important flags that we were losing
in the LSB encoded case.
> As @benwtrent pointed out, we should accumulate from the `outputPrefix`
instead of `arc.output`. I raised #12900 for this. This patch seems to fix the
exception when searching `WildcardQuery(new Term("body", "*fo*"))` on
`Wikibig1m`. I'll try`Wikibigall` as well.
+1 -- this is the right fix (to not lose any leading bytes for the FST's
output in `IntersectTermsEnum`). I'll review the PR and open followon issue to
somehow expose the bug with stronger BWC test.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]