Tony-X commented on issue #12513:
URL: https://github.com/apache/lucene/issues/12513#issuecomment-1682810659

   Copy paste the relevant wiki section about tantivy's TermDictionary
   
   ---
   
   ## tantivy TermDictionary
   
   Tantivy's term dictionary has two components. An FST that encodes `term -> 
term_ordinal (u64)` where term ordinal is the index of the term in the sorted 
order. The `TermInfoStore` maps the ordinal to the postings metadata. 
Conceptually, `TermInfoStore` is just a big vector of postings metadata ordered 
by the term ordinal. Internally, it applies additional compression but still 
offers random access. 
   
   ### TermInfoStore encodings
   
   ```
   
   
                                    │
           Metadata Section         │                  Blocks
     ┌───┬───┬──────────────┬───┬───┼───────────┬──────────────────┬─────────┐
     │   │   │              │   │   │           │                  │         │
     │   │   │              │   │   │           │                  │         │
     │ 1 │   │   ... ...    │n-1│ n │  block 1  │     ... ...      │ block n │
     │   │   │              │   │   │           │                  │         │
     └───┴───┴──────────────┴───┴───┼───────────┴──────────────────┴─────────┘
                                    │
           n fix-sized record       │
                                    │
   ```
   
   
   The metadata record contains all information needed to decode the 
corresponding block. 
   
   ```rust
   struct TermInfoBlockMeta {
       offset: u64,
       ref_term_info: TermInfo,
       doc_freq_nbits: u8,
       postings_offset_nbits: u8,
       positions_offset_nbits: u8,
   }
   
   pub struct TermInfo {
       /// Number of documents in the segment containing the term
       pub doc_freq: u32,
       /// Byte range of the posting list within the postings (`.idx`) file.
       pub postings_range: Range<usize>,
       /// Byte range of the positions of this terms in the positions (`.pos`) 
file.
       pub positions_range: Range<usize>,
   }
   ```
   * offset: the start offset of the data block in the block section
   * ref_term_info: the reference `TermInfo` on top of which the delta is 
applied
   * *_nbits: the bit-width used to bit-unpack freq and postings/position 
offsets in the data block.
   
   At search time, both FST and the `TermInfoStore` are loaded into memory. To 
search for a term, it first consults the FST to get back the term ordinal if it 
exists. Note that here the FST contains all the terms so the lookup process can 
tell if a term does not exist. The term ordinal is used to determine which 
metadata record and that terms relative offset within the data block by simply 
modulo the (fixed) block size. Then it decodes the metadata record which helps 
to locate the data block and decode the `TermInfo` (postings metadata).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to