Tony-X commented on issue #12513: URL: https://github.com/apache/lucene/issues/12513#issuecomment-1701907189
I'd like to seek for some advices regarding the situation I am in -- I want to preserve the nice properties of the tantivy's termdict as I port it over for Lucene 1. definitive term lookup; no additional scan is required for non-existent term after FST lookup. 2. random-addressing term information given an ordinal. again no additional scan; (2) is possible because after reading the metadata block it can determine the record size in the corresponding data block, such that it knows the term's data starts at `data_block_offest + (term_ord % block_size) * record_size`. This is the benefit of having a fixed record size per term in a block. However there are a few things in the way current Lucene90PostingsFormat encode as TermState make it challening: 1. docid pulsing -- sometimes when the term just has a single document associated with it. We don't write any posting but to reuse the docStartFp in the term's data to record that singleton docid. 2. skipOffset -- only set when the term has >= 128 docs 3. lastPosBlockOffset -- similarly, this optionally indicates if there is a VInt encoded remainder block of positions. I can think of one way to solve them is to change the posting formart to make each posting/position record self-descriptive. e.g. adding a small header per postings list in the pos file to store singleton docid and the skip_offset as well as storing number of packed blocks before positions data starts. But this will require changing the posting reader/writer. I wonder if there are smarter ways (as compare to store everything with their full width) to achieve the fixed size record and preferably not changing Postings. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org