Tony-X commented on issue #12513:
URL: https://github.com/apache/lucene/issues/12513#issuecomment-1724547262

   I've been designing how to possibly account for the optional states that 
each term may end up with. Namely how to deal with the following: 
   * if a term has singleton docid
   * if a term has skip data 
   * if a term has a last vInt encoded position block (only relevant when the 
field has position enabled)
   
   I came up with a divide et impera(Divide and Conquer) approach. The idea is 
to classify which case out of the 8 outcomes (2^3, as there are 3 dimensions) a 
term belongs to. At indexing time, for a given field, within each category the 
terms information share the same structure and we can apply the RefBlock + 
bit-packing encoding scheme. We will still use an FST to encode term's ordinal. 
However, instead of storing the global ordinal we will store (category, ord) 
where the ord is the ordinal within the category. This can be fit in to a long 
with 3 bits for category and rest for ordinal. 
   
   At search time, to look up a term, we consult the FST to get back the 
category and local ordinal. Then locate the data file for that category and 
extract out the term information with the local ordinal.
   
   Of course, I need to handle multiple fields, etc. and there are details like 
how to organize the files. Besides all that, I believe this scheme can work 
out. In particular it has a few nice properties
   * FST can be definitive about if term exists or not
   * After FST lookup, locating the term only takes two seeks to randomly 
access the term. 
   
   On the other hand, it might not compress to the best as it potentially 
could, especially for those monotonically increasing values such as postings 
start offset. That's because near-by terms (by their global ordinal) may be 
spread into different category thus losing the locality a little bit. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to