herley-shaori commented on PR #15824: URL: https://github.com/apache/lucene/pull/15824#issuecomment-4064549985
> Thanks @herley-shaori -- that Panama per-byte bounds check is frustrating -- maybe @uschindler has an idea how to "elide" it? > > I'm curious why we didn't detect this slowdown in Lucene's nightly benchmarks? It runs the `PKLookup` tasks which is exactly this "look up by primary key `id`" field usage: https://benchmarks.mikemccandless.com/PKLookup.html -- maybe that benchmark is missing something realistic since your/OpenSearch's usage was clearly (heavily!) affected? The discrepancy comes down to segment count. The nightly benchmark and the regression scenario exercise fundamentally different workload profiles: **Nightly benchmark (PKLookup):** - Uses a well-merged index (standard TieredMergePolicy) - Likely has ~10-20 large segments after merging - Each seekExact call traverses a small number of segments - At low segment counts, the trie's compact on-disk format wins: fewer cache misses, smaller index footprint, and the checkValidStateRaw() overhead per mmap read is amortized over fewer total calls **Issue #15820 scenario (OpenSearch):** - 32 KNN indices × 6 shards, refresh_interval=1s, continuous ingestion - ~400 small segments per index (before merges catch up) - Every indexed document with explicit _id triggers seekExact on **every segment** to check for version conflicts - At 32 TPS × 20 docs/bulk × 400 segments = **~256,000** seekExact **calls/sec** - Each seekExact navigates the trie byte-by-byte: for a 36-byte UUID _id, that's up to 36 lookupChild() calls, each doing 2-4 RandomAccessInput.readLong() on mmap - Total mmap reads: ~256K × ~36 × ~3 = **~27M mmap reads/sec**, each triggering MemorySessionImpl.checkValidStateRaw() The cost model flips at high segment counts: total_overhead = segment_count × terms_per_doc × bytes_per_term × reads_per_node × checkValidStateRaw_cost. When segment count is low (benchmark), the constant factors of trie's compact format dominate. When segment count is high (real-world ingestion), the per-read mmap overhead dominates. **Why heap loading is the right fix:** - The trie index is typically small (much smaller than the terms data it indexes), so heap overhead is minimal - This matches lucene90 FST behavior where the term index was always heap-resident - ByteBuffersDataInput backed by a single heap ByteBuffer takes the fast path: blocks[0].getLong(blockOffset) — a direct Java array read with no Panama bounds check - It preserves the trie's structural advantages (compact format, simple navigation) while eliminating the mmap overhead **Benchmark suggestion:** The current PKLookup benchmark may benefit from an additional variant that tests with a **high segment count** (e.g., 200-500 small segments, NoMergePolicy, UUID-like random keys). This would catch regressions in the "many small segments during active ingestion" workload, which is common in near-real-time search systems. The current PerThreadPKLookup already sorts segments largest-first for early termination, but with many small segments of similar size, this optimization doesn't help much. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
