kotman12 commented on issue #14886:
URL: https://github.com/apache/lucene/issues/14886#issuecomment-3036969030

   > (why/how) does the order matter?
   
   Order matters because [the cacheId is the original query Id plus the 
ordinal](https://github.com/apache/lucene/blob/3f71b54c47adecee6490d3c6c8f8041b062ebec6/lucene/monitor/src/java/org/apache/lucene/monitor/QueryCacheEntry.java#L56)
 of the disjunct coming from `decompose` and that is determined by `HashSet`. 
[cacheId](https://github.com/apache/lucene/blob/3f71b54c47adecee6490d3c6c8f8041b062ebec6/lucene/monitor/src/java/org/apache/lucene/monitor/WritableQueryIndex.java#L368)
 determines which dijsunct(s) get(s) run in the second phase matching step. So 
if there is a mismatch between the pre-searcher derived/indexed disjunct terms 
(say derived by JVM 1) and the actual disjuncted query that backs those terms 
(say derived by JVM 2) then you can have missing matches.
   
   > Intuitively since QueryDecomposer seems to return Set<Query> rather than 
(say) List<Query> it seems that order would not matter.
   
   I had the reverse reaction .. because I _first_ knew how `cacheId` was 
derived, I was very surprised to eventually discover that the order from 
`decompose` was actually determined by `HashSet`. Now mind you, if you never 
shut the process this wouldn't matter in practice because although `HashSet` 
order is not explicitly defined, it will be the same for the lifetime of the 
application ([the JVM doesn't guarantee 
this](https://docs.oracle.com/javase/8/docs/api/java/util/HashSet.html#:~:text=It%20makes%20no%20guarantees%20as%20to%20the%20iteration%20order%20of%20the%20set%3B%20in%20particular%2C%20it%20does%20not%20guarantee%20that%20the%20order%20will%20remain%20constant%20over%20time)
 but it is a well-understood implementation detail). The problem comes when you 
read a query from disk in a different process, then decompose it into a _new_ 
`HashSet` with a _new_ `BytesRef` seed and try to match those disjuncts to the 
cacheIds derived by a different JVM. This non-determinism can be c
 onfirmed by repeatedly running the 
[test](https://github.com/kotman12/lucene/commit/1424fd652898c5dab5f3a148463561724893b5a4)
 I linked. I suspect most lucene monitor users don't use the 
`WritableQueryIndex` and instead rely on the simpler, in-memory variety which 
isn't affected by this bug. This bug also fails silently which could contribute 
to why it wasn't brought up before.
   
   > Or if order matters in some scenarios might some sort of sorter be applied 
over the sets after decomposition?
   
   I am open to this idea but since a general `Query` isn't a `Comparable` I 
don't immediately know what order to use here. The simplest thing I could think 
of is to just use the order of the `BooleanQuery::clauses` which are backed by 
a `List` whose encounter order should be stable from process to process unless 
there is a parser change and/or there is some non-determinism introduced into 
query parsing (or if the backing collection changes). Of course this is also 
relying on an implementation detail but at least one confined to the lucene 
project and one that isn't effectively guaranteed to vary wildly from process 
to process like `HashSet` will when [you give it a random hashcode 
seed](https://github.com/apache/lucene/blob/3f71b54c47adecee6490d3c6c8f8041b062ebec6/lucene/core/src/java/org/apache/lucene/util/StringHelper.java#L139).
 A more stable sort than this would presumably require some 
introspection/"fingerprinting" of the query itself, which is why I mentioned 
collecting the un
 derlying terms and using checksum groupings as the order/identifier. If there 
some other way to order a set of generic `Query`s I would be glad to hear it. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to