iprithv commented on code in PR #16240:
URL: https://github.com/apache/lucene/pull/16240#discussion_r3426627958
##########
lucene/core/src/java/org/apache/lucene/index/FilteredTermsEnum.java:
##########
@@ -228,19 +249,20 @@ public BytesRef next() throws IOException {
if (doSeek) {
doSeek = false;
final BytesRef t = nextSeekTerm(actualTerm);
- // System.out.println(" seek to t=" + (t == null ? "null" :
t.utf8ToString()) + " tenum=" +
- // tenum);
// Make sure we always seek forward:
assert actualTerm == null || t == null || t.compareTo(actualTerm) > 0
: "curTerm=" + actualTerm + " seekTerm=" + t;
- if (t == null || tenum.seekCeil(t) == SeekStatus.END) {
- // no more terms to seek to or enum exhausted
- // System.out.println(" return null");
+ if (t == null) {
+ return null;
+ }
+ if (--visitsBudget < 0 || tenum.seekCeil(t) == SeekStatus.END) {
Review Comment:
budget gets decremented before checking seekCeil/next, so if the enum
naturally runs out of terms and visitsBudget happens to land on 0 at the same
time, isVisitsBudgetExhausted() returns true even though you got all the
terms...caller then throws away complete results and falls back to deferred
collection for no reason
can we have separate boolean budgetExhausted flag that only gets set when
the budget check actually fires?
##########
lucene/core/src/java/org/apache/lucene/search/AbstractMultiTermQueryConstantScoreWrapper.java:
##########
@@ -228,62 +237,66 @@ public ScorerSupplier scorerSupplier(LeafReaderContext
context) throws IOExcepti
final long cost;
final IOLongFunction<WeightOrDocIdSetIterator> weightOrIteratorSupplier;
- // Only collect terms while building the ScorerSupplier when the query
exposes a known,
- // bounded term count (e.g. TermInSetQuery, getTermsCount() >= 0).
There, collecting is
- // cheap and lets us return a null supplier up-front so a parent
BooleanQuery can
- // short-circuit.
- //
- // For queries with an unknown term count (e.g. automaton queries:
wildcard / regexp /
- // prefix / range), collecting eagerly can scan the whole term
dictionary during
- // ScorerSupplier construction -- a leading wildcard such as "*foo*"
cannot seek and must
- // visit every term. That is supposed to be the cheap "planning" phase,
and doing it there
- // defeats a parent conjunction's ability to short-circuit (a sibling
clause matching no
- // documents can no longer skip this clause before the scan runs). So
for an unknown term
- // count we estimate the cost and defer term collection to
ScorerSupplier#get().
- if (q.getTermsCount() >= 0) {
- List<TermAndState> collectedTerms = new ArrayList<>();
- boolean collectResult = collectTerms(fieldDocCount, termsEnum,
collectedTerms);
- if (collectResult) {
- // Return a null supplier if no query terms were in the segment:
- if (collectedTerms.isEmpty()) {
- return null;
- }
+ // Try to eagerly collect matching terms. For queries with a known term
count
+ // (e.g. TermInSetQuery), we always collect eagerly. For queries with an
unknown term
+ // count (e.g. automaton queries: wildcard / regexp / prefix / range),
we attempt a
+ // budgeted probe: if the automaton can find all matching terms within a
small number of
+ // underlying TermsEnum operations, we use those results. Otherwise
(probe exhausts its
+ // budget, or no probe is possible), we estimate the cost and defer term
collection to
+ // ScorerSupplier#get() -- eagerly scanning the whole term dictionary
during the
+ // "planning" phase would defeat a parent conjunction's ability to
short-circuit.
+ List<TermAndState> eagerTerms = new ArrayList<>();
+ TermsEnum deferredTermsEnum = termsEnum;
+ boolean eagerSuccess;
- // TODO: Instead of replicating the cost logic of a BooleanQuery we
could consider
- // rewriting to a BQ eagerly at this point and delegating to its
cost method (instead of
- // lazily rewriting on #get). Not sure what the performance hit
would be of doing this
- // though.
- long sumTermCost = 0;
- for (TermAndState collectedTerm : collectedTerms) {
- sumTermCost += collectedTerm.docFreq;
- }
- cost = sumTermCost;
+ if (q.getTermsCount() >= 0) {
+ eagerSuccess = collectTerms(fieldDocCount, termsEnum, eagerTerms);
+ } else {
+ // Unknown term count. Try a cheap budgeted probe: if the automaton
can find
+ // all matching terms within a small number of underlying TermsEnum
operations,
+ // use those results eagerly. Otherwise, fall back to deferred
collection.
+ FilteredTermsEnum probeEnum = null;
+ if (termsEnum instanceof FilteredTermsEnum fte) {
+ probeEnum = fte;
+ } else if (q instanceof AutomatonQuery aq
+ && aq.getCompiled().type ==
CompiledAutomaton.AUTOMATON_TYPE.NORMAL) {
+ probeEnum = new AutomatonTermsEnum(terms.iterator(),
aq.getCompiled());
+ }
+ if (probeEnum != null) {
+ probeEnum.setVisitsBudget(AUTOMATON_TERM_COLLECT_VISIT_BUDGET);
+ boolean probeResult = collectTerms(fieldDocCount, probeEnum,
eagerTerms);
+ eagerSuccess = probeResult && !probeEnum.isVisitsBudgetExhausted();
} else {
- cost = estimateCost(terms, q.getTermsCount());
+ eagerSuccess = false;
}
- weightOrIteratorSupplier =
- leadCost -> {
- if (collectResult) {
- return rewriteAsBooleanQuery(context, collectedTerms);
- } else {
- // Too many terms to rewrite as a simple bq.
- // Invoke rewriteInner logic to handle rewriting:
- return rewriteInner(
- context, fieldDocCount, terms, termsEnum, collectedTerms,
leadCost);
- }
- };
+ if (!eagerSuccess) {
+ deferredTermsEnum = (probeEnum == termsEnum) ? q.getTermsEnum(terms)
: termsEnum;
+ eagerTerms = new ArrayList<>();
Review Comment:
when the probe runs out of budget, this throws away whatever terms it
already found (eagerTerms = new ArrayList<>())... those are valid matches :)
since collectTerms appends to the list, we could keep the partial results and
let the deferred pass continue from where the probe stopped, instead of redoing
that work?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]