GovindBalaji-S-Glean opened a new pull request, #16051: URL: https://github.com/apache/lucene/pull/16051
### Description PR for Lucene dev@ thread: https://lists.apache.org/thread/ct04woc11hh9vclhscz8pkdozv6xoy6k Adds an alternative `TermInSetQuery` implementation in the sandbox module that stores terms as a sorted `BytesRef[]` wrapping a packed `byte[]`, rather than `PrefixCodedTerms`. Trades some RAM for cheaper per-segment iteration and a vectorized `equals`/`hashCode` fast path on the cache hot path. Includes a JMH benchmark in `benchmark-jmh`. #### Build / test - `./gradlew :lucene:sandbox:test --tests "*ArrayTermInSet*"` → 17 tests pass on JDK 25. - `./gradlew :lucene:benchmark-jmh:assemble` → green. - `./gradlew tidy` → no formatting changes outside the 4 files in this PR. - `./gradlew :lucene:sandbox:check :lucene:benchmark-jmh:check` → green on JDK 25 (Temurin 25.0.3). #### Benchmark Full matrix on a 32-core / 125 GB GCP n2 VM, JDK 25 (Temurin 25.0.3), JMH 1.37, single fork, 3 warmup × 5 measurement × 5 s. ### Benchmark parameters | Param | Values | What it varies | Used by | |---|---|---|---| | `numTerms` | `30`, `300`, `3000`, `30000` | Number of terms in the query — from a tiny ACL filter to a large enumerated set. | all benchmarks | | `numSegments` | `5`, `20`, `50` | Number of segments in the index. Each segment pays the per-term seek cost, so this scales the iteration phase. | `construct+iterate*` only | | `indexContent` | `QUERY_ONLY` · `SPARSE` · `RANDOM_50K` | What lives in each segment's terms dictionary (see below). | `construct+iterate*` only | | `inputShape` | `UNSORTED_LIST` · `SORTED_SET` | How the query terms are passed to the constructor (see below). | `construct*` and `construct+iterate*` | #### `indexContent` - **`QUERY_ONLY`** — every query term is indexed in every segment. All seeks hit. - **`SPARSE`** — only 2 deterministic terms (first + middle of the query set) per segment, so most seeks miss. The shape that hurts `TermInSetQuery` the most because `PrefixCodedTerms` decodes per term regardless of hit/miss. - **`RANDOM_50K`** — 50 000 random terms per segment, independent of the query. Zero hits, but a large dictionary to navigate. Stresses the seek path on a deep terms tree. #### `inputShape` - **`UNSORTED_LIST`** — `Arrays.asList(shuffled)`. Both queries radix-sort internally, so this measures sort + storage cost. - **`SORTED_SET`** — `TreeSet<BytesRef>` (natural-order comparator). Both queries hit their skip-sort fast path, so this isolates storage-shape cost from sort cost. #### Benchmark methods | Method | Measures | |---|---| | `construct{TermInSet,ArrayTermInSet}Query` | Constructor cost only. | | `constructAndIterate{TermInSet,ArrayTermInSet}Query` | End-to-end query execution: ctor + per-segment terms iteration. | | `equals{TermInSetQuery, ArrayTermInSetQuery, FlatPacked, PackedPlusLengths}` | Cache-hit equality on equal queries — the `LRUQueryCache` hot path. | Numbers are µs/op average. ##### `construct` only — query ctor cost | numTerms | shape | Array µs | TIS µs | TIS / Array | |---:|---|---:|---:|---:| | 30 | SORTED_SET | 0.86 | 2.28 | **2.6×** | | 30 | UNSORTED_LIST | 2.87 | 4.06 | 1.4× | | 300 | SORTED_SET | 11.1 | 22.0 | **2.0×** | | 300 | UNSORTED_LIST | 32.8 | 43.5 | 1.3× | | 3 000 | SORTED_SET | 108 | 215 | **2.0×** | | 3 000 | UNSORTED_LIST | 361 | 477 | 1.3× | | 30 000 | SORTED_SET | 1 106 | 2 182 | **2.0×** | | 30 000 | UNSORTED_LIST | 5 400 | 5 952 | 1.1× | Skip-sort fast path nets a clean ~2× across all sizes; on `UNSORTED_LIST` the gap collapses at 30k terms because the radix sort dominates and both queries pay it. ##### `construct + per-segment iterate` — the path real queries take, slice at `numSegments=20` Full benchmark output: https://gist.github.com/GovindBalaji-S-Glean/fe4af91bead4dcd2390fa4da381b09e7 | numTerms | indexContent | shape | Array µs | TIS µs | TIS / Array | |---:|---|---|---:|---:|---:| | 300 | QUERY_ONLY | SORTED_SET | 902 | 1 049 | 1.16× | | 300 | RANDOM_50K | SORTED_SET | 2 293 | 2 687 | 1.17× | | 300 | **SPARSE** | SORTED_SET | **108** | **193** | **1.79×** | | 3 000 | QUERY_ONLY | SORTED_SET | 10 071 | 11 229 | 1.11× | | 3 000 | **SPARSE** | SORTED_SET | **356** | **1 212** | **3.41×** | | 30 000 | QUERY_ONLY | SORTED_SET | 87 202 | 118 585 | 1.36× | | 30 000 | RANDOM_50K | SORTED_SET | 60 005 | 76 420 | 1.27× | | 30 000 | **SPARSE** | SORTED_SET | **2 819** | **11 593** | **4.11×** | Pattern: - **Dense** indexes (`QUERY_ONLY`, `RANDOM_50K`): Array wins consistently, **1.1–1.4×**. Iteration cost dominates; storage shape matters less. - **Sparse** indexes (`SPARSE` — 2 indexed terms vs a 30/300/3k/30k-term query): Array wins **1.5–4.5×** and the gap widens with `numTerms × numSegments`. This is exactly the case where TIS's `PrefixCodedTerms` decode runs once per query term per segment while Array just walks a sorted `BytesRef[]`. ##### `equals` (cache-hit equality on equal queries) | numTerms | Array µs | TIS µs | TIS / Array | |---:|---:|---:|---:| | 300 | 0.091 ± 0.001 | 0.235 ± 0.001 | **2.58×** | | 3 000 | 1.030 ± 0.005 | 1.354 ± 0.008 | 1.31× | | 30 000 | 13.92 ± 1.97 | 12.86 ± 2.29 | noise (intervals overlap) | Vectorized `Arrays.equals` fast path nets a clean 2.58× / 1.31× win at 300 / 3 000 terms; at 30 000 terms both are statistically indistinguishable (~13 µs ± a few µs) — byte compare has saturated memory bandwidth. #### Trade-off Array keeps the sorted `byte[]` + offsets index live for the lifetime of the query, vs TIS's `PrefixCodedTerms` which compresses common prefixes. For small/medium term counts this is negligible; for 30k term sets the per-query memory difference is on the order of a few hundred KB. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
