kotman12 opened a new issue, #14886: URL: https://github.com/apache/lucene/issues/14886
### Description The QueryDecomposer currently [orders disjuncts](https://github.com/apache/lucene/blob/a7fc3d0ae8314ebf1236eea3076096b836d19ca8/lucene/monitor/src/java/org/apache/lucene/monitor/QueryDecomposer.java#L80) by hashcode (via HashSet), which means WritableQueryIndex’s cache ([re-derived at startup](https://github.com/apache/lucene/blob/a7fc3d0ae8314ebf1236eea3076096b836d19ca8/lucene/monitor/src/java/org/apache/lucene/monitor/WritableQueryIndex.java#L164)) can end up in a different order than the on-disk index. Since Lucene uses a random hash seed per JVM, a query read by a different process than the one that wrote it exhibits undefined behavior, silently dropping matches. You should be able to see this simple [test](https://github.com/kotman12/lucene/commit/1424fd652898c5dab5f3a148463561724893b5a4) flip between matching and not, depending on the seed. My draft [PR](https://github.com/apache/lucene/pull/14885) switches to insertion order. It isn’t perfect as clause order itself may change across, say, parsers of different lucene versions, but it’s still a lot better than hashcode. Another option would be to compute checksums over the query terms to alert (or maybe even reindex) if this condition is detected. The unique set of terms of a particular query will likely be the most stable feature to rely on (although still not 100% perfect), presumably more stable than clause order and definitely more stable than hashcode. You can even imagine taking it a step further and grouping disjuncts into "term-equivalent disjuncts", i.e.: `(x1 OR .. OR xn) OR (y1 OR ... OR ym) OR ... ` in this case `(x1 OR .. OR xn)` and `(y1 OR ... OR ym)` are two such term-equivalent disjuncts, themselves potentially composed of many disjuncts, such that each individual disjunct `x1, ... , xn` shares the same set of _unique_ terms. I'd guess that in a large majority of cases each term-equivalent disjunct would be a singleton and thus would have very similar matching performance to the existing solution. The advantage is you'd have an identifier (the checksum) instead of relying on an index into an order you can't control. Now I imagine you also can't be 100% sure that terms of a particular query will remain unchanged across version changes either, since there may be some parsing optimization that prunes, say, redundant terms. However, I imagine this is the _least_ likely to change. The crux of the problem is there is no guaranteed way to serialize the sub-query into something parseable. This constraint creates room for a lot of trappy solutions. Maybe the real solution is to just disable the `QueryDecomposer` for `WritableQueryIndex`? @romseygeek @cpoerschke @bjacobowitz I believe you might know the most .. hopefully you don't mind the cc! ### Version and environment details Lucene on main. As far as I can tell this bug has existed since lucene-monitor was added. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org