msfroh commented on issue #13029: URL: https://github.com/apache/lucene/issues/13029#issuecomment-2014023729
I think the `else` clause for the cost estimate is also not great. I came across this same problem where a user was essentially running a single-term `TermInSetQuery` (that actually matched a single doc) AND a numeric range query that matches millions of docs. I was shocked when profiler output showed a lot of time spent in the BKReader -- the IndexOrDocValues query over the range query should have clearly taken the doc values path. Let's say you have a string field with 10 million distinct values (so 10 million terms), and they match 20 million documents (with individual terms matching 1-3 docs, say). My read is that this `estimateCost()` function will say that a `TermInSetQuery` over a single term has cost 10,000,001 (i.e. 20 million `sumDocFreq` minus 10 million for the terms `size`, plus 1 for the query term count). I get that the absolute worst case is that 9,999,999 terms each have doc freq 1 and the remaining term has doc freq 10,000,001, but this feels silly as a cost estimate for a query that is just going to rewrite to a single `TermQuery` with cost <= 3. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org