[GitHub] [lucene] Tony-X commented on issue #12513: Try out a tantivy's term dictionary format

via GitHub Thu, 31 Aug 2023 16:29:20 -0700


Tony-X commented on issue #12513:
URL: https://github.com/apache/lucene/issues/12513#issuecomment-1701907189


   I'd like to seek for some advices regarding the situation I am in -- 
   
   I want to preserve the nice properties of the tantivy's termdict as I port 
it over for Lucene
   1. definitive term lookup; no additional scan is required for non-existent 
term after FST lookup.
   2. random-addressing term information given an ordinal. again no additional 
scan;
   
   (2) is possible because after reading the metadata block it can determine 
the record size in the corresponding data block, such that it knows the term's 
data  starts at `data_block_offest + (term_ord % block_size) * record_size`. 
This is the benefit of having a fixed record size per term in a block. However 
there are a few things in the way current Lucene90PostingsFormat  encode as 
TermState make it challening:
   1. docid pulsing -- sometimes when the term just has a single document 
associated with it. We don't write any posting but to reuse the docStartFp in 
the term's data to record that singleton docid.
   2. skipOffset -- only set when the term has >= 128 docs
   3. lastPosBlockOffset -- similarly, this optionally indicates if there is a 
VInt encoded remainder block of positions.
   
   I can think of one way to solve them is to change the posting formart to 
make each posting/position record self-descriptive. e.g. adding a small header 
per postings list in the pos file to store singleton docid and the skip_offset 
as well as storing number of packed blocks before positions data starts. But 
this will require changing the posting reader/writer.
   
   I wonder if there are smarter ways (as compare to store everything with 
their full width) to achieve the fixed size record and preferably not changing 
Postings.  
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] Tony-X commented on issue #12513: Try out a tantivy's term dictionary format

Reply via email to