[GitHub] [lucene] Tony-X commented on issue #12513: Try out a tantivy's term dictionary format

via GitHub Thu, 17 Aug 2023 12:03:15 -0700


Tony-X commented on issue #12513:
URL: https://github.com/apache/lucene/issues/12513#issuecomment-1682810659

Copy paste the relevant wiki section about tantivy's TermDictionary

---

## tantivy TermDictionary

Tantivy's term dictionary has two components. An FST that encodes `term ->
term_ordinal (u64)` where term ordinal is the index of the term in the sorted
order. The `TermInfoStore` maps the ordinal to the postings metadata.
Conceptually, `TermInfoStore` is just a big vector of postings metadata ordered
by the term ordinal. Internally, it applies additional compression but still
offers random access.

### TermInfoStore encodings

```

│
Metadata Section │ Blocks
┌───┬───┬──────────────┬───┬───┼───────────┬──────────────────┬─────────┐
│ │ │ │ │ │ │ │ │
│ │ │ │ │ │ │ │ │
│ 1 │ │ ... ... │n-1│ n │ block 1 │ ... ... │ block n │
│ │ │ │ │ │ │ │ │
└───┴───┴──────────────┴───┴───┼───────────┴──────────────────┴─────────┘
│
n fix-sized record │
│
```

The metadata record contains all information needed to decode the
corresponding block.

```rust
struct TermInfoBlockMeta {
offset: u64,
ref_term_info: TermInfo,
doc_freq_nbits: u8,
postings_offset_nbits: u8,
positions_offset_nbits: u8,
}

pub struct TermInfo {
/// Number of documents in the segment containing the term
pub doc_freq: u32,
/// Byte range of the posting list within the postings (`.idx`) file.
pub postings_range: Range<usize>,
/// Byte range of the positions of this terms in the positions (`.pos`)
file.
pub positions_range: Range<usize>,
}
```
* offset: the start offset of the data block in the block section
* ref_term_info: the reference `TermInfo` on top of which the delta is
applied
* *_nbits: the bit-width used to bit-unpack freq and postings/position
offsets in the data block.

At search time, both FST and the `TermInfoStore` are loaded into memory. To
search for a term, it first consults the FST to get back the term ordinal if it
exists. Note that here the FST contains all the terms so the lookup process can
tell if a term does not exist. The term ordinal is used to determine which
metadata record and that terms relative offset within the data block by simply
modulo the (fixed) block size. Then it decodes the metadata record which helps
to locate the data block and decode the `TermInfo` (postings metadata).

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] Tony-X commented on issue #12513: Try out a tantivy's term dictionary format

Reply via email to