[I] WritableQueryIndex Relies On Hashcode Invariance Across JVMs [lucene]

via GitHub Tue, 01 Jul 2025 07:42:33 -0700


kotman12 opened a new issue, #14886:
URL: https://github.com/apache/lucene/issues/14886

### Description

The QueryDecomposer currently [orders
disjuncts](https://github.com/apache/lucene/blob/a7fc3d0ae8314ebf1236eea3076096b836d19ca8/lucene/monitor/src/java/org/apache/lucene/monitor/QueryDecomposer.java#L80)
by hashcode (via HashSet), which means WritableQueryIndex’s cache ([re-derived
at
startup](https://github.com/apache/lucene/blob/a7fc3d0ae8314ebf1236eea3076096b836d19ca8/lucene/monitor/src/java/org/apache/lucene/monitor/WritableQueryIndex.java#L164))
can end up in a different order than the on-disk index. Since Lucene uses a
random hash seed per JVM, a query read by a different process than the one that
wrote it exhibits undefined behavior, silently dropping matches. You should be
able to see this simple
[test](https://github.com/kotman12/lucene/commit/1424fd652898c5dab5f3a148463561724893b5a4)
flip between matching and not, depending on the seed.

My draft [PR](https://github.com/apache/lucene/pull/14885) switches to
insertion order. It isn’t perfect as clause order itself may change across,
say, parsers of different lucene versions, but it’s still a lot better than
hashcode.

Another option would be to compute checksums over the query terms to alert
(or maybe even reindex) if this condition is detected. The unique set of terms
of a particular query will likely be the most stable feature to rely on
(although still not 100% perfect), presumably more stable than clause order and
definitely more stable than hashcode. You can even imagine taking it a step
further and grouping disjuncts into "term-equivalent disjuncts", i.e.:

`(x1 OR .. OR xn) OR (y1 OR ... OR ym) OR ... `

in this case `(x1 OR .. OR xn)` and `(y1 OR ... OR ym)` are two such
term-equivalent disjuncts, themselves potentially composed of many disjuncts,
such that each individual disjunct `x1, ... , xn` shares the same set of
_unique_ terms. I'd guess that in a large majority of cases each
term-equivalent disjunct would be a singleton and thus would have very similar
matching performance to the existing solution. The advantage is you'd have an
identifier (the checksum) instead of relying on an index into an order you
can't control. Now I imagine you also can't be 100% sure that terms of a
particular query will remain unchanged across version changes either, since
there may be some parsing optimization that prunes, say, redundant terms.
However, I imagine this is the _least_ likely to change.

The crux of the problem is there is no guaranteed way to serialize the
sub-query into something parseable. This constraint creates room for a lot of
trappy solutions. Maybe the real solution is to just disable the
`QueryDecomposer` for `WritableQueryIndex`?

@romseygeek @cpoerschke @bjacobowitz I believe you might know the most ..
hopefully you don't mind the cc!

### Version and environment details

Lucene on main. As far as I can tell this bug has existed since
lucene-monitor was added.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[I] WritableQueryIndex Relies On Hashcode Invariance Across JVMs [lucene]

Reply via email to