mikemccand commented on issue #12702: URL: https://github.com/apache/lucene/issues/12702#issuecomment-1772457667
> The floor data is guaranteed to be stored within single arc (never be prefix shared) in FST because fp is encoded before it. But won't the leading bytes of `fp` be shared in that prefix (since we switched to MSB vLong encoding)? > Out of curiosity, i tried to completely disable suffix sharing in block tree index, result in only 1.47% total .tip size increased for wikimediumall. This is impressive! I would have expected a worse impact. This is likely because `BlockTree` is essentially storing a prefix-ish trie in RAM, not the full terms dictionary. So the suffixes are mostly dropped from the FST index and left in the term blocks stored separately. > I wonder if we can avoid adding floor data outputs into NodeHash some way? I'm curious: are there any `floorData` outputs in `NodeHash` (shared suffixes) at all today in the BlockTree terms index? On the ["limit how much RAM FST Compiler is allowed to use to share suffixes" PR](https://github.com/apache/lucene/pull/12633) I also tested fully disabling `NodeHash` (no prefix sharing) when storing all `wikimediumall` index `body` terms in an FST but found a much bigger increase (65% increase: 350.2 MB -> 577.4 MB), because the suffixes are stored. Similarly, if we explore [experimental codecs that hold all terms in an FST](https://github.com/apache/lucene/pull/12688), now possible / reasonable since the FST is off-heap, sharing the suffixes will be important. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org