herley-shaori commented on PR #15824:
URL: https://github.com/apache/lucene/pull/15824#issuecomment-4064549985

   > Thanks @herley-shaori -- that Panama per-byte bounds check is frustrating 
-- maybe @uschindler has an idea how to "elide" it?
   > 
   > I'm curious why we didn't detect this slowdown in Lucene's nightly 
benchmarks? It runs the `PKLookup` tasks which is exactly this "look up by 
primary key `id`" field usage: 
https://benchmarks.mikemccandless.com/PKLookup.html -- maybe that benchmark is 
missing something realistic since your/OpenSearch's usage was clearly 
(heavily!) affected?
   
   The discrepancy comes down to segment count. The nightly benchmark and the 
regression scenario exercise fundamentally different workload profiles:
   
   **Nightly benchmark (PKLookup):**
     - Uses a well-merged index (standard TieredMergePolicy)
     - Likely has ~10-20 large segments after merging
     - Each seekExact call traverses a small number of segments
     - At low segment counts, the trie's compact on-disk format wins: fewer 
cache misses, smaller index footprint, and
     the checkValidStateRaw() overhead per mmap read is amortized over fewer 
total calls
   
   **Issue #15820 scenario (OpenSearch):**
     - 32 KNN indices × 6 shards, refresh_interval=1s, continuous ingestion
     - ~400 small segments per index (before merges catch up)
     - Every indexed document with explicit _id triggers seekExact on **every 
segment** to check for version conflicts
     - At 32 TPS × 20 docs/bulk × 400 segments = **~256,000** seekExact 
**calls/sec**
     - Each seekExact navigates the trie byte-by-byte: for a 36-byte UUID _id, 
that's up to 36 lookupChild() calls, each
     doing 2-4 RandomAccessInput.readLong() on mmap
     - Total mmap reads: ~256K × ~36 × ~3 = **~27M mmap reads/sec**, each 
triggering MemorySessionImpl.checkValidStateRaw()
   
   The cost model flips at high segment counts: total_overhead = segment_count 
× terms_per_doc × bytes_per_term ×
   reads_per_node × checkValidStateRaw_cost. When segment count is low 
(benchmark), the constant factors of trie's compact format dominate. When 
segment count is high (real-world ingestion), the per-read mmap overhead 
dominates.
   
   **Why heap loading is the right fix:**
     - The trie index is typically small (much smaller than the terms data it 
indexes), so heap overhead is minimal
     - This matches lucene90 FST behavior where the term index was always 
heap-resident
     - ByteBuffersDataInput backed by a single heap ByteBuffer takes the fast 
path: blocks[0].getLong(blockOffset) — a
     direct Java array read with no Panama bounds check
     - It preserves the trie's structural advantages (compact format, simple 
navigation) while eliminating the mmap
     overhead
   
   **Benchmark suggestion:**
   The current PKLookup benchmark may benefit from an additional variant that 
tests with a **high segment count** (e.g.,
   200-500 small segments, NoMergePolicy, UUID-like random keys). This would 
catch regressions in the "many small
   segments during active ingestion" workload, which is common in 
near-real-time search systems. The current
   PerThreadPKLookup already sorts segments largest-first for early 
termination, but with many small segments of
   similar size, this optimization doesn't help much.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to