[PR] [feature](be)(fe) Introduce SPIMI V4 inverted index storage format [doris]

via GitHub Mon, 25 May 2026 07:25:46 -0700


airborne12 opened a new pull request, #63633:
URL: https://github.com/apache/doris/pull/63633


   ### What problem does this PR solve?
   
   Issue Number: close #xxx
   
   Related PR: #xxx
   
   Problem Summary:
   
   Introduces a new inverted index storage format **V4** powered by SPIMI
   (Single-Pass In-Memory Indexing), replacing the CLucene `IndexWriter`
   on the write path for analyzed (fulltext) string columns.
   
   #### Why
   
   CLucene's `IndexWriter` accumulates per-token `Posting` linked-list
   nodes plus a term hash table plus a char[] interning pool. On Doris
   fulltext columns this dominates BE memory during write and shows up
   in OOM kills on large segments. The encoding is byte-equivalent to
   Lucene 2.x, but the *in-memory* representation is the cost. SPIMI
   keeps a flat `(term_id, doc_id, position)` record array plus a
   single intern arena, then sorts + emits the same Lucene 2.x sibling
   files (`.tis/.tii/.frq/.prx/.fnm/segments_N`) on `finish()`. The
   on-disk format is unchanged; only the writer's working memory shape
   changes.
   
   #### Measured impact (SPIMI_BENCH=1, ~614 K occurrences/segment)
   
   | Dimension | V4 vs V2 (mostly_unique / all_unique) | V4 vs V2 (repetitive, 
vocab=16) |
   
|-----------|--------------------------------------|--------------------------------|
   | Writer peak memory  | **−55.6 %** / **−55.6 %** | +406 % (160 KB → 811 KB; 
both negligible) |
   | Writer CPU (median) | **−68 %** / **−68 %** | +5 % (within bench cap) |
   | `.idx` on-disk size | ~0 % | +8 % (PFOR sub-block header overhead) |
   | Query latency       | ~0 % | (not measured at bench scale) |
   
   Repetitive vocab is the architectural trade-off region: V4's
   compact-mode VInt-delta stream scales per-occurrence while CLucene's
   Posting struct scales per-unique-term. Absolute memory in this
   regime is sub-MB on both sides, so the percentage swing has no
   production impact. Storage-size delta on repetitive is the
   documented PFOR header cost.
   
   #### What's in this PR
   
   - **V4 writer pipeline** (`be/src/storage/index/inverted/spimi/`):
     `SpimiPostingBuffer` (flat record + arena + intern map with
     hybrid compact-mode VInt-delta migration), `SegmentWriter`,
     `TermDictWriter`, `FieldInfosWriter`, `SegmentInfosWriter`,
     PFOR encoder for high-doc-freq postings, `ByteOutput` family
     abstracting CLucene's `IndexOutput`.
   - **V4 reader pipeline**: `SpimiQueryIndexReader`,
     `SpimiTermDocsReader`, `SpimiProxReader`, `SpimiTermEnum`,
     `SpimiSearcherBuilder`; `SpimiFulltextIndexReader` is the
     Doris-side adapter (overrides `type() -> SPIMI_FULLTEXT` so
     the searcher cache routes correctly).
   - **`column_reader.cpp` dispatch**: V4 storage format → SPIMI
     reader; V1/V2/V3 unchanged.
   - **`EmitSegment` post-flush self-validation**:
     `ValidateClosedSegmentByteCounts` re-queries on-disk file
     lengths after close, throws `INVERTED_INDEX_FILE_CORRUPTED` on
     mismatch — guards against the async-S3 partial-flush class of
     bugs that single-node tests can't see.
   - **108 BE unit tests** under `be/test/storage/index/inverted/spimi/`
     plus extended tests under `be/test/storage/segment/`:
     - 17 corruption-path tests covering every `SPIMI_THROW_CORRUPT`
       site (segments_N / .frq / .prx / PFOR / .tis-.tii readers)
     - 7 byte-count validator tests including the truncation
       fault-injection case
     - Storage-size benchmark (V2 vs V4 `.idx` byte parity)
     - Throughput benchmark with 11 runs + 2 warmup discards +
       randomized V2/V4 alternation + full distribution report
     - Memory benchmark across mostly_unique / all_unique /
       repetitive workloads
     - Query-latency benchmark via the production read path
       (`InvertedIndexReaderTest.SpimiV2V4QueryLatencyBenchmark`)
       using the corrected `SpimiFulltextIndexReader::create_shared`
       dispatch
   - **`SPIMI_BENCH` env-var tier**: default UT runs use 12 K
     occurrences (fast regression guard); `SPIMI_BENCH=1` scales to
     ~614 K, `SPIMI_BENCH=large` scales to ~6 M for full-segment
     stress. Keeps headline benchmark numbers reproducible without
     ballooning every UT pass.
   - **Regression suites**:
     - `inverted_index_p0/storage_format/test_storage_format_v4`
       — V2 vs V4 black-box parity across MATCH_ANY / MATCH_ALL /
       MATCH_PHRASE / MATCH_PHRASE_PREFIX / MATCH_REGEXP, NULL/empty
       handling, and the `support_phrase=false` (omit_tfap) no-prox
       write+read path.
     - `test_storage_format_v4_cloud` — same coverage gated by
       `isCloudMode()` so the async-S3 upload path gets exercised.
     - `test_storage_format_v4_query_latency` — cluster-level
       V2 vs V4 query timing distribution.
   - **FE plumbing** (`PropertyAnalyzer`, `TabletIndex`,
     `OlapTable`): accept `inverted_index_storage_format=V4` in
     CREATE TABLE PROPERTIES; propagate through the protocol to BE.
   
   #### What's NOT in this PR (known gaps)
   
   - V4 segment compaction across multiple SPIMI segments — V4
     currently emits a single `_0` segment per column; compaction is
     documented as a follow-up in `SPIMI_DESIGN.md`.
   - BM25-style scoring on V4 — V4 sets `omit_norms=true`; the read
     side synthesizes a default-norm array. Score-using paths
     (`MATCH_ALL` with relevance ordering) fall back to V2 behavior
     on V4 columns. Listed in design doc.
   - V4 only covers analyzed (fulltext) string columns. Keyword-mode
     (`should_analyzer=false`) and numeric (BKD) paths remain on the
     existing writers.
   
   ### Release note
   
   Add inverted index storage format V4, an in-house SPIMI-based writer
   that reduces BE write-side memory by ~55 % and CPU by ~68 % on
   diverse-vocab fulltext workloads while keeping segment on-disk
   format Lucene 2.x compatible. Enable by setting
   `inverted_index_storage_format = "V4"` in CREATE TABLE PROPERTIES.
   
   ### Check List (For Author)
   
   - Test <!-- At least one of them must be included. -->
       - [x] Regression test
       - [x] Unit Test
       - [ ] Manual test (add detailed scripts or steps below)
       - [ ] No need to test or manual test. Explain why:
           - [ ] This is a refactor/code format and no logic has been changed.
           - [ ] Previous test can cover this change.
           - [ ] No code files have been changed.
           - [ ] Other reason
   
   - Behavior changed:
       - [ ] No.
       - [x] Yes. New value 'V4' accepted by inverted_index_storage_format 
property; V1/V2/V3 paths unchanged.
   
   - Does this need documentation?
       - [ ] No.
       - [x] Yes. Doc PR will follow against apache/doris-website.
   
   ### Check List (For Reviewer who merge this PR)
   
   - [ ] Confirm the release note
   - [ ] Confirm test cases
   - [ ] Confirm document
   - [ ] Add branch pick label
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] [feature](be)(fe) Introduce SPIMI V4 inverted index storage format [doris]

Reply via email to