[ 
https://issues.apache.org/jira/browse/LUCENE-10207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17436183#comment-17436183
 ] 

Greg Miller commented on LUCENE-10207:
--------------------------------------

I spent a little more time on this today and would be interested in any 
feedback on getting the "term count" for these MultiTermQueries for the purpose 
of estimating cost. For TermInSetQuery, this is known up-front since the user 
provides the set of terms. But MultiTermQueries in general often don't "know" 
their term count up-front (until producing their TermsEnum, which itself seems 
costly).

I'm considering adding a new method to MultiTermQueries that allows 
implementations to provide their term count if known, or -1 if not known. Then 
estimating cost like this:

{code:java}
      Terms indexTerms = context.reader().terms(query.getField());

      int queryTermsCount = query.getTermsCount();
      if (indexTerms == null) {
        cost = 0;  // field doesn't exist
      } else if (queryTermsCount == -1) {
        cost = indexTerms.getDocCount();
      } else {
        cost = Math.min(indexTerms.getDocCount(), queryTermsCount + 
(indexTerms.getSumDocFreq() - indexTerms.size()));
      }
{code}

Does this seem like a reasonable approach? Any other ideas?

The other issue here is that we don't actually know at this point how many of 
the query terms are actually in the index. So this could potentially 
over-estimate cost if there a huge set of terms that aren't in the index. But 
solving for that requires intersecting the indexed terms with the query terms, 
which adds up-front cost.

> Make TermInSetQuery usable with IndexOrDocValuesQuery
> -----------------------------------------------------
>
>                 Key: LUCENE-10207
>                 URL: https://issues.apache.org/jira/browse/LUCENE-10207
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Adrien Grand
>            Priority: Minor
>         Attachments: LUCENE-10207_multitermquery.patch
>
>
> IndexOrDocValuesQuery is very useful to pick the right execution mode for a 
> query depending on other bits of the query tree.
> We would like to be able to use it to optimize execution of TermInSetQuery. 
> However IndexOrDocValuesQuery only works well if the "index" query can give 
> an estimation of the cost of the query without doing anything expensive (like 
> looking up all terms of the TermInSetQuery in the terms dict). Maybe we could 
> implement it for primary keys (terms.size() == sumDocFreq) by returning the 
> number of terms of the query? Another idea is to multiply the number of terms 
> by the average postings length, though this could be dangerous if the field 
> has a zipfian distribution and some terms have a much higher doc frequency 
> than the average.
> [~romseygeek] and I were discussing this a few weeks ago, and more recently 
> [~mikemccand] and [~gsmiller] again independently. So it looks like there is 
> interest in this. Here is an email thread where this was recently discussed: 
> https://lists.apache.org/thread.html/re3b20a486c9a4e66b2ca4a2646e2d3be48535a90cdd95911a8445183%40%3Cdev.lucene.apache.org%3E.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to