Tony-X commented on issue #12513: URL: https://github.com/apache/lucene/issues/12513#issuecomment-1682810659
Copy paste the relevant wiki section about tantivy's TermDictionary --- ## tantivy TermDictionary Tantivy's term dictionary has two components. An FST that encodes `term -> term_ordinal (u64)` where term ordinal is the index of the term in the sorted order. The `TermInfoStore` maps the ordinal to the postings metadata. Conceptually, `TermInfoStore` is just a big vector of postings metadata ordered by the term ordinal. Internally, it applies additional compression but still offers random access. ### TermInfoStore encodings ``` │ Metadata Section │ Blocks ┌───┬───┬──────────────┬───┬───┼───────────┬──────────────────┬─────────┐ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ 1 │ │ ... ... │n-1│ n │ block 1 │ ... ... │ block n │ │ │ │ │ │ │ │ │ │ └───┴───┴──────────────┴───┴───┼───────────┴──────────────────┴─────────┘ │ n fix-sized record │ │ ``` The metadata record contains all information needed to decode the corresponding block. ```rust struct TermInfoBlockMeta { offset: u64, ref_term_info: TermInfo, doc_freq_nbits: u8, postings_offset_nbits: u8, positions_offset_nbits: u8, } pub struct TermInfo { /// Number of documents in the segment containing the term pub doc_freq: u32, /// Byte range of the posting list within the postings (`.idx`) file. pub postings_range: Range<usize>, /// Byte range of the positions of this terms in the positions (`.pos`) file. pub positions_range: Range<usize>, } ``` * offset: the start offset of the data block in the block section * ref_term_info: the reference `TermInfo` on top of which the delta is applied * *_nbits: the bit-width used to bit-unpack freq and postings/position offsets in the data block. At search time, both FST and the `TermInfoStore` are loaded into memory. To search for a term, it first consults the FST to get back the term ordinal if it exists. Note that here the FST contains all the terms so the lookup process can tell if a term does not exist. The term ordinal is used to determine which metadata record and that terms relative offset within the data block by simply modulo the (fixed) block size. Then it decodes the metadata record which helps to locate the data block and decode the `TermInfo` (postings metadata). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org