[jira] [Commented] (LUCENE-10207) Make TermInSetQuery usable with IndexOrDocValuesQuery

Greg Miller (Jira) Thu, 04 Aug 2022 11:49:08 -0700


    [ 
https://issues.apache.org/jira/browse/LUCENE-10207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17575402#comment-17575402
 ]


Greg Miller commented on LUCENE-10207:
--------------------------------------

I'm coming back to this work now as I'm working on another project that would 
benefit from the ability to use a {{TermInSetQuery}} within an 
{{IndexOrDocValuesQuery}}. Where this work stalled last year was in answering 
whether-or-not making {{TermInSetQuery}} extend {{MultiTermQuery}} would have a 
negative performance impact, since the term intersection implementation would 
differ. The motivation for extending {{MultiTermQuery}} was to make a doc 
values-based term-in-set implementation easy (using the existing 
{{DocValuesRewriteMethod}}.

I suggest we separate some of these concerns to make progress. The sandbox 
module already has {{DocValuesTermsQuery}} that could be paired with 
{{TermInSetQuery}} inside of {{IndexOrDocValuesQuery}}. But, we still can't use 
{{TermInSetQuery}} in a {{IndexOrDocValuesQuery}} since {{TermInSetQuery}} 
doesn't provide a {{ScoreSupplier}} with cost estimation. I propose we address 
this first, and not worry about refactoring {{TermInSetQuery}} to extend 
{{MultiTermQuery}} at this point. This would be incremental progress that 
enable using {{TermInSetQuery}} + {{DocValuesTermsQuery}} in an 
{{IndexOrDocValuesQuery}}, while not requiring us to answer the performance 
impact of changing {{TermInSetQuery}} to extend {{MultiTermQuery}}.

I've opened a separate PR to make this iterative step: 
https://github.com/apache/lucene/pull/1058

> Make TermInSetQuery usable with IndexOrDocValuesQuery
> -----------------------------------------------------
>
>                 Key: LUCENE-10207
>                 URL: https://issues.apache.org/jira/browse/LUCENE-10207
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Adrien Grand
>            Assignee: Greg Miller
>            Priority: Minor
>         Attachments: LUCENE-10207_multitermquery.patch
>
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> IndexOrDocValuesQuery is very useful to pick the right execution mode for a 
> query depending on other bits of the query tree.
> We would like to be able to use it to optimize execution of TermInSetQuery. 
> However IndexOrDocValuesQuery only works well if the "index" query can give 
> an estimation of the cost of the query without doing anything expensive (like 
> looking up all terms of the TermInSetQuery in the terms dict). Maybe we could 
> implement it for primary keys (terms.size() == sumDocFreq) by returning the 
> number of terms of the query? Another idea is to multiply the number of terms 
> by the average postings length, though this could be dangerous if the field 
> has a zipfian distribution and some terms have a much higher doc frequency 
> than the average.
> [~romseygeek] and I were discussing this a few weeks ago, and more recently 
> [~mikemccand] and [~gsmiller] again independently. So it looks like there is 
> interest in this. Here is an email thread where this was recently discussed: 
> https://lists.apache.org/thread.html/re3b20a486c9a4e66b2ca4a2646e2d3be48535a90cdd95911a8445183%40%3Cdev.lucene.apache.org%3E.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10207) Make TermInSetQuery usable with IndexOrDocValuesQuery

Reply via email to