airborne12 opened a new pull request, #63633:
URL: https://github.com/apache/doris/pull/63633
### What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary:
Introduces a new inverted index storage format **V4** powered by SPIMI
(Single-Pass In-Memory Indexing), replacing the CLucene `IndexWriter`
on the write path for analyzed (fulltext) string columns.
#### Why
CLucene's `IndexWriter` accumulates per-token `Posting` linked-list
nodes plus a term hash table plus a char[] interning pool. On Doris
fulltext columns this dominates BE memory during write and shows up
in OOM kills on large segments. The encoding is byte-equivalent to
Lucene 2.x, but the *in-memory* representation is the cost. SPIMI
keeps a flat `(term_id, doc_id, position)` record array plus a
single intern arena, then sorts + emits the same Lucene 2.x sibling
files (`.tis/.tii/.frq/.prx/.fnm/segments_N`) on `finish()`. The
on-disk format is unchanged; only the writer's working memory shape
changes.
#### Measured impact (SPIMI_BENCH=1, ~614 K occurrences/segment)
| Dimension | V4 vs V2 (mostly_unique / all_unique) | V4 vs V2 (repetitive,
vocab=16) |
|-----------|--------------------------------------|--------------------------------|
| Writer peak memory | **−55.6 %** / **−55.6 %** | +406 % (160 KB → 811 KB;
both negligible) |
| Writer CPU (median) | **−68 %** / **−68 %** | +5 % (within bench cap) |
| `.idx` on-disk size | ~0 % | +8 % (PFOR sub-block header overhead) |
| Query latency | ~0 % | (not measured at bench scale) |
Repetitive vocab is the architectural trade-off region: V4's
compact-mode VInt-delta stream scales per-occurrence while CLucene's
Posting struct scales per-unique-term. Absolute memory in this
regime is sub-MB on both sides, so the percentage swing has no
production impact. Storage-size delta on repetitive is the
documented PFOR header cost.
#### What's in this PR
- **V4 writer pipeline** (`be/src/storage/index/inverted/spimi/`):
`SpimiPostingBuffer` (flat record + arena + intern map with
hybrid compact-mode VInt-delta migration), `SegmentWriter`,
`TermDictWriter`, `FieldInfosWriter`, `SegmentInfosWriter`,
PFOR encoder for high-doc-freq postings, `ByteOutput` family
abstracting CLucene's `IndexOutput`.
- **V4 reader pipeline**: `SpimiQueryIndexReader`,
`SpimiTermDocsReader`, `SpimiProxReader`, `SpimiTermEnum`,
`SpimiSearcherBuilder`; `SpimiFulltextIndexReader` is the
Doris-side adapter (overrides `type() -> SPIMI_FULLTEXT` so
the searcher cache routes correctly).
- **`column_reader.cpp` dispatch**: V4 storage format → SPIMI
reader; V1/V2/V3 unchanged.
- **`EmitSegment` post-flush self-validation**:
`ValidateClosedSegmentByteCounts` re-queries on-disk file
lengths after close, throws `INVERTED_INDEX_FILE_CORRUPTED` on
mismatch — guards against the async-S3 partial-flush class of
bugs that single-node tests can't see.
- **108 BE unit tests** under `be/test/storage/index/inverted/spimi/`
plus extended tests under `be/test/storage/segment/`:
- 17 corruption-path tests covering every `SPIMI_THROW_CORRUPT`
site (segments_N / .frq / .prx / PFOR / .tis-.tii readers)
- 7 byte-count validator tests including the truncation
fault-injection case
- Storage-size benchmark (V2 vs V4 `.idx` byte parity)
- Throughput benchmark with 11 runs + 2 warmup discards +
randomized V2/V4 alternation + full distribution report
- Memory benchmark across mostly_unique / all_unique /
repetitive workloads
- Query-latency benchmark via the production read path
(`InvertedIndexReaderTest.SpimiV2V4QueryLatencyBenchmark`)
using the corrected `SpimiFulltextIndexReader::create_shared`
dispatch
- **`SPIMI_BENCH` env-var tier**: default UT runs use 12 K
occurrences (fast regression guard); `SPIMI_BENCH=1` scales to
~614 K, `SPIMI_BENCH=large` scales to ~6 M for full-segment
stress. Keeps headline benchmark numbers reproducible without
ballooning every UT pass.
- **Regression suites**:
- `inverted_index_p0/storage_format/test_storage_format_v4`
— V2 vs V4 black-box parity across MATCH_ANY / MATCH_ALL /
MATCH_PHRASE / MATCH_PHRASE_PREFIX / MATCH_REGEXP, NULL/empty
handling, and the `support_phrase=false` (omit_tfap) no-prox
write+read path.
- `test_storage_format_v4_cloud` — same coverage gated by
`isCloudMode()` so the async-S3 upload path gets exercised.
- `test_storage_format_v4_query_latency` — cluster-level
V2 vs V4 query timing distribution.
- **FE plumbing** (`PropertyAnalyzer`, `TabletIndex`,
`OlapTable`): accept `inverted_index_storage_format=V4` in
CREATE TABLE PROPERTIES; propagate through the protocol to BE.
#### What's NOT in this PR (known gaps)
- V4 segment compaction across multiple SPIMI segments — V4
currently emits a single `_0` segment per column; compaction is
documented as a follow-up in `SPIMI_DESIGN.md`.
- BM25-style scoring on V4 — V4 sets `omit_norms=true`; the read
side synthesizes a default-norm array. Score-using paths
(`MATCH_ALL` with relevance ordering) fall back to V2 behavior
on V4 columns. Listed in design doc.
- V4 only covers analyzed (fulltext) string columns. Keyword-mode
(`should_analyzer=false`) and numeric (BKD) paths remain on the
existing writers.
### Release note
Add inverted index storage format V4, an in-house SPIMI-based writer
that reduces BE write-side memory by ~55 % and CPU by ~68 % on
diverse-vocab fulltext workloads while keeping segment on-disk
format Lucene 2.x compatible. Enable by setting
`inverted_index_storage_format = "V4"` in CREATE TABLE PROPERTIES.
### Check List (For Author)
- Test <!-- At least one of them must be included. -->
- [x] Regression test
- [x] Unit Test
- [ ] Manual test (add detailed scripts or steps below)
- [ ] No need to test or manual test. Explain why:
- [ ] This is a refactor/code format and no logic has been changed.
- [ ] Previous test can cover this change.
- [ ] No code files have been changed.
- [ ] Other reason
- Behavior changed:
- [ ] No.
- [x] Yes. New value 'V4' accepted by inverted_index_storage_format
property; V1/V2/V3 paths unchanged.
- Does this need documentation?
- [ ] No.
- [x] Yes. Doc PR will follow against apache/doris-website.
### Check List (For Reviewer who merge this PR)
- [ ] Confirm the release note
- [ ] Confirm test cases
- [ ] Confirm document
- [ ] Add branch pick label
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]