[PR] Add ArrayTermInSetQuery to sandbox/ along with a JMH benchmark [lucene]

via GitHub Tue, 12 May 2026 02:50:59 -0700


GovindBalaji-S-Glean opened a new pull request, #16051:
URL: https://github.com/apache/lucene/pull/16051


   ### Description
   
   PR for Lucene dev@ thread: 
https://lists.apache.org/thread/ct04woc11hh9vclhscz8pkdozv6xoy6k
   
   Adds an alternative `TermInSetQuery` implementation in the sandbox module 
that stores terms as a sorted `BytesRef[]` wrapping a packed `byte[]`, rather 
than `PrefixCodedTerms`. Trades some RAM for cheaper per-segment iteration and 
a vectorized `equals`/`hashCode` fast path on the cache hot path. Includes a 
JMH benchmark in `benchmark-jmh`.
   
   #### Build / test
   
   - `./gradlew :lucene:sandbox:test --tests "*ArrayTermInSet*"` → 17 tests 
pass on JDK 25.
   - `./gradlew :lucene:benchmark-jmh:assemble` → green.
   - `./gradlew tidy` → no formatting changes outside the 4 files in this PR.
   - `./gradlew :lucene:sandbox:check :lucene:benchmark-jmh:check` → green on 
JDK 25 (Temurin 25.0.3).
   
   #### Benchmark
   
   Full matrix on a 32-core / 125 GB GCP n2 VM, JDK 25 (Temurin 25.0.3), JMH 
1.37, single fork, 3 warmup × 5 measurement × 5 s.
   
   ### Benchmark parameters
   
   | Param | Values | What it varies | Used by |
   |---|---|---|---|
   | `numTerms` | `30`, `300`, `3000`, `30000` | Number of terms in the query — 
from a tiny ACL filter to a large enumerated set. | all benchmarks |
   | `numSegments` | `5`, `20`, `50` | Number of segments in the index. Each 
segment pays the per-term seek cost, so this scales the iteration phase. | 
`construct+iterate*` only |
   | `indexContent` | `QUERY_ONLY` &nbsp;·&nbsp; `SPARSE` &nbsp;·&nbsp; 
`RANDOM_50K` | What lives in each segment's terms dictionary (see below). | 
`construct+iterate*` only |
   | `inputShape` | `UNSORTED_LIST` &nbsp;·&nbsp; `SORTED_SET` | How the query 
terms are passed to the constructor (see below). | `construct*` and 
`construct+iterate*` |
   
   #### `indexContent`
   
   - **`QUERY_ONLY`** — every query term is indexed in every segment. All seeks 
hit.
   - **`SPARSE`** — only 2 deterministic terms (first + middle of the query 
set) per segment, so most seeks miss. The shape that hurts `TermInSetQuery` the 
most because `PrefixCodedTerms` decodes per term regardless of hit/miss.
   - **`RANDOM_50K`** — 50 000 random terms per segment, independent of the 
query. Zero hits, but a large dictionary to navigate. Stresses the seek path on 
a deep terms tree.
   
   #### `inputShape`
   
   - **`UNSORTED_LIST`** — `Arrays.asList(shuffled)`. Both queries radix-sort 
internally, so this measures sort + storage cost.
   - **`SORTED_SET`** — `TreeSet<BytesRef>` (natural-order comparator). Both 
queries hit their skip-sort fast path, so this isolates storage-shape cost from 
sort cost.
   
   #### Benchmark methods
   
   | Method | Measures |
   |---|---|
   | `construct{TermInSet,ArrayTermInSet}Query` | Constructor cost only. |
   | `constructAndIterate{TermInSet,ArrayTermInSet}Query` | End-to-end query 
execution: ctor + per-segment terms iteration. |
   | `equals{TermInSetQuery, ArrayTermInSetQuery, FlatPacked, 
PackedPlusLengths}` | Cache-hit equality on equal queries — the `LRUQueryCache` 
hot path. |
   
   Numbers are µs/op average.
   
   ##### `construct` only — query ctor cost
   
   | numTerms | shape | Array µs | TIS µs | TIS / Array |
   |---:|---|---:|---:|---:|
   | 30 | SORTED_SET | 0.86 | 2.28 | **2.6×** |
   | 30 | UNSORTED_LIST | 2.87 | 4.06 | 1.4× |
   | 300 | SORTED_SET | 11.1 | 22.0 | **2.0×** |
   | 300 | UNSORTED_LIST | 32.8 | 43.5 | 1.3× |
   | 3 000 | SORTED_SET | 108 | 215 | **2.0×** |
   | 3 000 | UNSORTED_LIST | 361 | 477 | 1.3× |
   | 30 000 | SORTED_SET | 1 106 | 2 182 | **2.0×** |
   | 30 000 | UNSORTED_LIST | 5 400 | 5 952 | 1.1× |
   
   Skip-sort fast path nets a clean ~2× across all sizes; on `UNSORTED_LIST` 
the gap collapses at 30k terms because the radix sort dominates and both 
queries pay it.
   
   ##### `construct + per-segment iterate` — the path real queries take, slice 
at `numSegments=20`
   Full benchmark output: 
https://gist.github.com/GovindBalaji-S-Glean/fe4af91bead4dcd2390fa4da381b09e7
   
   | numTerms | indexContent | shape | Array µs | TIS µs | TIS / Array |
   |---:|---|---|---:|---:|---:|
   | 300 | QUERY_ONLY | SORTED_SET | 902 | 1 049 | 1.16× |
   | 300 | RANDOM_50K | SORTED_SET | 2 293 | 2 687 | 1.17× |
   | 300 | **SPARSE** | SORTED_SET | **108** | **193** | **1.79×** |
   | 3 000 | QUERY_ONLY | SORTED_SET | 10 071 | 11 229 | 1.11× |
   | 3 000 | **SPARSE** | SORTED_SET | **356** | **1 212** | **3.41×** |
   | 30 000 | QUERY_ONLY | SORTED_SET | 87 202 | 118 585 | 1.36× |
   | 30 000 | RANDOM_50K | SORTED_SET | 60 005 | 76 420 | 1.27× |
   | 30 000 | **SPARSE** | SORTED_SET | **2 819** | **11 593** | **4.11×** |
   
   Pattern:
   - **Dense** indexes (`QUERY_ONLY`, `RANDOM_50K`): Array wins consistently, 
**1.1–1.4×**. Iteration cost dominates; storage shape matters less.
   - **Sparse** indexes (`SPARSE` — 2 indexed terms vs a 30/300/3k/30k-term 
query): Array wins **1.5–4.5×** and the gap widens with `numTerms × 
numSegments`. This is exactly the case where TIS's `PrefixCodedTerms` decode 
runs once per query term per segment while Array just walks a sorted 
`BytesRef[]`.
   
   ##### `equals` (cache-hit equality on equal queries)
   
   | numTerms | Array µs | TIS µs | TIS / Array |
   |---:|---:|---:|---:|
   | 300 | 0.091 ± 0.001 | 0.235 ± 0.001  | **2.58×** |
   | 3 000 | 1.030 ± 0.005 | 1.354 ± 0.008  | 1.31× |
   | 30 000 | 13.92 ± 1.97 | 12.86 ± 2.29 | noise (intervals overlap) |
   
   Vectorized `Arrays.equals` fast path nets a clean 2.58× / 1.31× win at 300 / 
3 000 terms; at 30 000 terms both are statistically indistinguishable (~13 µs ± 
a few µs) — byte compare has saturated memory bandwidth.
   
   #### Trade-off
   
   Array keeps the sorted `byte[]` + offsets index live for the lifetime of the 
query, vs TIS's `PrefixCodedTerms` which compresses common prefixes. For 
small/medium term counts this is negligible; for 30k term sets the per-query 
memory difference is on the order of a few hundred KB.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] Add ArrayTermInSetQuery to sandbox/ along with a JMH benchmark [lucene]

Reply via email to