Tony-X commented on issue #12513: URL: https://github.com/apache/lucene/issues/12513#issuecomment-1724547262
I've been designing how to possibly account for the optional states that each term may end up with. Namely how to deal with the following: * if a term has singleton docid * if a term has skip data * if a term has a last vInt encoded position block (only relevant when the field has position enabled) I came up with a divide et impera(Divide and Conquer) approach. The idea is to classify which case out of the 8 outcomes (2^3, as there are 3 dimensions) a term belongs to. At indexing time, for a given field, within each category the terms information share the same structure and we can apply the RefBlock + bit-packing encoding scheme. We will still use an FST to encode term's ordinal. However, instead of storing the global ordinal we will store (category, ord) where the ord is the ordinal within the category. This can be fit in to a long with 3 bits for category and rest for ordinal. At search time, to look up a term, we consult the FST to get back the category and local ordinal. Then locate the data file for that category and extract out the term information with the local ordinal. Of course, I need to handle multiple fields, etc. and there are details like how to organize the files. Besides all that, I believe this scheme can work out. In particular it has a few nice properties * FST can be definitive about if term exists or not * After FST lookup, locating the term only takes two seeks to randomly access the term. On the other hand, it might not compress to the best as it potentially could, especially for those monotonically increasing values such as postings start offset. That's because near-by terms (by their global ordinal) may be spread into different category thus losing the locality a little bit. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org