Re: [I] Optimize FST suffix sharing for block tree index [lucene]

via GitHub Fri, 20 Oct 2023 03:14:18 -0700


mikemccand commented on issue #12702:
URL: https://github.com/apache/lucene/issues/12702#issuecomment-1772457667


   > The floor data is guaranteed to be stored within single arc (never be 
prefix shared) in FST because fp is encoded before it.
   
   But won't the leading bytes of `fp` be shared in that prefix (since we 
switched to MSB vLong encoding)?
   
   > Out of curiosity, i tried to completely disable suffix sharing in block 
tree index, result in only 1.47% total .tip size increased for wikimediumall.
   
   This is impressive!  I would have expected a worse impact.
   
   This is likely because `BlockTree` is essentially storing a prefix-ish trie 
in RAM, not the full terms dictionary.  So the suffixes are mostly dropped from 
the FST index and left in the term blocks stored separately.
   
   > I wonder if we can avoid adding floor data outputs into NodeHash some way?
   
   I'm curious: are there any `floorData` outputs in `NodeHash` (shared 
suffixes) at all today in the BlockTree terms index?
   
   On the ["limit how much RAM FST Compiler is allowed to use to share 
suffixes" PR](https://github.com/apache/lucene/pull/12633) I also tested fully 
disabling `NodeHash` (no prefix sharing) when storing all `wikimediumall` index 
`body` terms in an FST but found a much bigger increase (65% increase: 350.2 MB 
-> 577.4 MB), because the suffixes are stored.
   
   Similarly, if we explore [experimental codecs that hold all terms in an 
FST](https://github.com/apache/lucene/pull/12688), now possible / reasonable 
since the FST is off-heap, sharing the suffixes will be important. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [I] Optimize FST suffix sharing for block tree index [lucene]

Reply via email to