[GitHub] [lucene] gsmiller commented on pull request #12089: Modify TermInSetQuery to "self optimize" if doc values are available

via GitHub Mon, 06 Feb 2023 06:22:16 -0800


gsmiller commented on PR #12089:
URL: https://github.com/apache/lucene/pull/12089#issuecomment-1419161514

@rmuir I grabbed your patch for adding a `ScoreSupplier` to
`DocValuesTermsQuery` (#12129) and reran benchmarks. The gap between IndexOrDV
and the "self-optimizing" TermInSetQuery have closed with this change. It looks
like I was wrong about the way IndexOrDV plans PK-type queries. I thought it
was choosing to use doc values based on what I saw in profiler output, but what
I was really seeing was the up-front ordinal lookups in `DocValuesTermsQuery`
as a result of not having the `ScoreSupplier` abstraction. With your patch,
that goes away.

The only gap that remains now is when the field is _not_ a PK-style field
but the terms being used in the disjunction have a low aggregate cost (relative
to the other terms in the field; e.g., `Medium Cardinality + Low Cost Country
Code Filter Terms`). In this case, IndexOrDV is always using doc values (due to
the field-level stats used for cost), but—by doing some term-seeking—we could
better decide to use postings.

Here are updated benchmark results:
[TiSBenchResults_Simplified_DVSSPatch.md.txt](https://github.com/apache/lucene/files/10663766/TiSBenchResults_Simplified_DVSSPatch.md.txt)
(Note that "low cardinality" cases are kind of terrible still because the
TiSQuery is being rewritten to a BooleanQuery)

> to me the issue is a problem with TermInSetQuery ScorerSupplier cost method

+1. Maybe there's a way to address this remaining gap by being smarter about
the cost function without term-seeking? That would be ideal.

I also played around with the idea of a "cost iterator" abstraction on
`ScoreSupplier` as a way to allow something like `TermInSetQuery` to provide
incremental costs to `IndexOrDocValuesQuery` as it term-seeks. This feels
clunky to me, and I'm not proposing it as a "good idea" right now, but I'll
share it as another approach. I was able to get comparable benchmark results
with this technique, and it still allows `IndexOrDocValuesQuery` to "own" the
decision between postings and doc values:
https://github.com/apache/lucene/compare/main...gsmiller:lucene:explore/tis-score-supplier-cost-iterator.
Benchmark results for this approach are here:
[TiSBenchResults_SSIterator.md.txt](https://github.com/apache/lucene/files/10663947/TiSBenchResults_SSIterator.md.txt).
It feels overly complicated though.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] gsmiller commented on pull request #12089: Modify TermInSetQuery to "self optimize" if doc values are available

Reply via email to