kotman12 commented on issue #14886: URL: https://github.com/apache/lucene/issues/14886#issuecomment-3036969030
> (why/how) does the order matter? Order matters because [the cacheId is the original query Id plus the ordinal](https://github.com/apache/lucene/blob/3f71b54c47adecee6490d3c6c8f8041b062ebec6/lucene/monitor/src/java/org/apache/lucene/monitor/QueryCacheEntry.java#L56) of the disjunct coming from `decompose` and that is determined by `HashSet`. [cacheId](https://github.com/apache/lucene/blob/3f71b54c47adecee6490d3c6c8f8041b062ebec6/lucene/monitor/src/java/org/apache/lucene/monitor/WritableQueryIndex.java#L368) determines which dijsunct(s) get(s) run in the second phase matching step. So if there is a mismatch between the pre-searcher derived/indexed disjunct terms (say derived by JVM 1) and the actual disjuncted query that backs those terms (say derived by JVM 2) then you can have missing matches. > Intuitively since QueryDecomposer seems to return Set<Query> rather than (say) List<Query> it seems that order would not matter. I had the reverse reaction .. because I _first_ knew how `cacheId` was derived, I was very surprised to eventually discover that the order from `decompose` was actually determined by `HashSet`. Now mind you, if you never shut the process this wouldn't matter in practice because although `HashSet` order is not explicitly defined, it will be the same for the lifetime of the application ([the JVM doesn't guarantee this](https://docs.oracle.com/javase/8/docs/api/java/util/HashSet.html#:~:text=It%20makes%20no%20guarantees%20as%20to%20the%20iteration%20order%20of%20the%20set%3B%20in%20particular%2C%20it%20does%20not%20guarantee%20that%20the%20order%20will%20remain%20constant%20over%20time) but it is a well-understood implementation detail). The problem comes when you read a query from disk in a different process, then decompose it into a _new_ `HashSet` with a _new_ `BytesRef` seed and try to match those disjuncts to the cacheIds derived by a different JVM. This non-determinism can be c onfirmed by repeatedly running the [test](https://github.com/kotman12/lucene/commit/1424fd652898c5dab5f3a148463561724893b5a4) I linked. I suspect most lucene monitor users don't use the `WritableQueryIndex` and instead rely on the simpler, in-memory variety which isn't affected by this bug. This bug also fails silently which could contribute to why it wasn't brought up before. > Or if order matters in some scenarios might some sort of sorter be applied over the sets after decomposition? I am open to this idea but since a general `Query` isn't a `Comparable` I don't immediately know what order to use here. The simplest thing I could think of is to just use the order of the `BooleanQuery::clauses` which are backed by a `List` whose encounter order should be stable from process to process unless there is a parser change and/or there is some non-determinism introduced into query parsing (or if the backing collection changes). Of course this is also relying on an implementation detail but at least one confined to the lucene project and one that isn't effectively guaranteed to vary wildly from process to process like `HashSet` will when [you give it a random hashcode seed](https://github.com/apache/lucene/blob/3f71b54c47adecee6490d3c6c8f8041b062ebec6/lucene/core/src/java/org/apache/lucene/util/StringHelper.java#L139). A more stable sort than this would presumably require some introspection/"fingerprinting" of the query itself, which is why I mentioned collecting the un derlying terms and using checksum groupings as the order/identifier. If there some other way to order a set of generic `Query`s I would be glad to hear it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org