kotman12 opened a new issue, #14886:
URL: https://github.com/apache/lucene/issues/14886

   ### Description
   
   The QueryDecomposer currently [orders 
disjuncts](https://github.com/apache/lucene/blob/a7fc3d0ae8314ebf1236eea3076096b836d19ca8/lucene/monitor/src/java/org/apache/lucene/monitor/QueryDecomposer.java#L80)
 by hashcode (via HashSet), which means WritableQueryIndex’s cache ([re-derived 
at 
startup](https://github.com/apache/lucene/blob/a7fc3d0ae8314ebf1236eea3076096b836d19ca8/lucene/monitor/src/java/org/apache/lucene/monitor/WritableQueryIndex.java#L164))
 can end up in a different order than the on-disk index. Since Lucene uses a 
random hash seed per JVM, a query read by a different process than the one that 
wrote it exhibits undefined behavior, silently dropping matches. You should be 
able to see this simple 
[test](https://github.com/kotman12/lucene/commit/1424fd652898c5dab5f3a148463561724893b5a4)
 flip between matching and not, depending on the seed.
   
   My draft [PR](https://github.com/apache/lucene/pull/14885) switches to 
insertion order. It isn’t perfect as clause order itself may change across, 
say, parsers of different lucene versions, but it’s still a lot better than 
hashcode. 
   
   Another option would be to compute checksums over the query terms to alert 
(or maybe even reindex) if this condition is detected. The unique set of terms 
of a particular query will likely be the most stable feature to rely on 
(although still not 100% perfect), presumably more stable than clause order and 
definitely more stable than hashcode. You can even imagine taking it a step 
further and grouping disjuncts into "term-equivalent disjuncts", i.e.:
   
   `(x1 OR .. OR xn) OR (y1 OR ... OR ym) OR ... `
   
   in this case `(x1 OR .. OR xn)` and `(y1 OR ... OR ym)` are two such 
term-equivalent disjuncts, themselves potentially composed of many disjuncts, 
such that each individual disjunct `x1, ... , xn`  shares the same set of 
_unique_ terms. I'd guess that in a large majority of cases each 
term-equivalent disjunct would be a singleton and thus would have very similar 
matching performance to the existing solution. The advantage is you'd have an 
identifier (the checksum) instead of relying on an index into an order you 
can't control. Now I imagine you also can't be 100% sure that terms of a 
particular query will remain unchanged across version changes either, since 
there may be some parsing optimization that prunes, say, redundant terms. 
However, I imagine this is the _least_ likely to change.
   
   The crux of the problem is there is no guaranteed way to serialize the 
sub-query into something parseable. This constraint creates room for a lot of 
trappy solutions. Maybe the real solution is to just disable the 
`QueryDecomposer` for `WritableQueryIndex`?
   
   @romseygeek @cpoerschke @bjacobowitz I believe you might know the most .. 
hopefully you don't mind the cc!
   
   ### Version and environment details
   
   Lucene on main. As far as I can tell this bug has existed since 
lucene-monitor was added.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to