Re: [I] Improve AbstractMultiTermQueryConstantScoreWrapper#RewritingWeight ScorerSupplier cost estimation [lucene]

via GitHub Thu, 21 Mar 2024 16:15:16 -0700


msfroh commented on issue #13029:
URL: https://github.com/apache/lucene/issues/13029#issuecomment-2014023729


   I think the `else` clause for the cost estimate is also not great. 
   
   I came across this same problem where a user was essentially running a 
single-term `TermInSetQuery` (that actually matched a single doc) AND a numeric 
range query that matches millions of docs. I was shocked when profiler output 
showed a lot of time spent in the BKReader -- the IndexOrDocValues query over 
the range query should have clearly taken the doc values path.
   
   Let's say you have a string field with 10 million distinct values (so 10 
million terms), and they match 20 million documents (with individual terms 
matching 1-3 docs, say). My read is that this `estimateCost()` function will 
say that a `TermInSetQuery` over a single term has cost 10,000,001 (i.e. 20 
million `sumDocFreq` minus 10 million for the terms `size`, plus 1 for the 
query term count). 
   
   I get that the absolute worst case is that 9,999,999 terms each have doc 
freq 1 and the remaining term has doc freq 10,000,001, but this feels silly as 
a cost estimate for a query that is just going to rewrite to a single 
`TermQuery` with cost <= 3.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [I] Improve AbstractMultiTermQueryConstantScoreWrapper#RewritingWeight ScorerSupplier cost estimation [lucene]

Reply via email to