drempapis commented on PR #16031:
URL: https://github.com/apache/lucene/pull/16031#issuecomment-4380467888
I don't think `FuzzyQuery` belongs in the same group as `RegexpQuery`,
`WildcardQuery`, `PrefixQuery`, and `TermRangeQuery`. Those queries take user
patterns, and the compiled automaton can grow without a clear bound. That's why
they keep the automaton on the query and report it through
`Accountable#ramBytesUsed()`.
I agree, `FuzzyQuery` is different. Its `DFA` only depends on `termLength`
and `maxEdits`. maxEdits is capped at **2**, and `prefixLength` is clipped to
the term length. So the size and the build time are bounded and `predictable`.
I ran a small JMH benchmark to investigate this
```
import java.util.concurrent.TimeUnit;
import org.apache.lucene.index.Term;
import org.apache.lucene.search.FuzzyQuery;
import org.apache.lucene.util.AttributeSource;
import org.openjdk.jmh.annotations.AuxCounters;
import org.openjdk.jmh.annotations.Benchmark;
import org.openjdk.jmh.annotations.BenchmarkMode;
import org.openjdk.jmh.annotations.Fork;
import org.openjdk.jmh.annotations.Level;
import org.openjdk.jmh.annotations.Measurement;
import org.openjdk.jmh.annotations.Mode;
import org.openjdk.jmh.annotations.OutputTimeUnit;
import org.openjdk.jmh.annotations.Param;
import org.openjdk.jmh.annotations.Scope;
import org.openjdk.jmh.annotations.Setup;
import org.openjdk.jmh.annotations.State;
import org.openjdk.jmh.annotations.Warmup;
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.MICROSECONDS)
@State(Scope.Benchmark)
@Fork(value = 1, warmups = 1)
@Warmup(iterations = 3, time = 1)
@Measurement(iterations = 5, time = 1)
public class FuzzyAutomatonRamBenchmark {
@Param({"16", "32", "64", "128", "256"})
public int termLength;
@Param({"1", "2"})
public int maxEdits;
private FuzzyQuery query;
@AuxCounters(AuxCounters.Type.EVENTS)
@State(Scope.Thread)
public static class AutomataRam {
public long ramBytes;
}
@Setup(Level.Trial)
public void setup() {
StringBuilder sb = new StringBuilder(termLength);
for (int i = 0; i < termLength; i++) {
sb.append((char) ('a' + (i % 26)));
}
query =
new FuzzyQuery(
new Term("f", sb.toString()),
maxEdits,
0,
FuzzyQuery.defaultMaxExpansions,
true);
}
@Benchmark
public long buildAutomata(AutomataRam metrics) {
long bytes = query.computeAutomataRamBytes(new AttributeSource());
metrics.ramBytes = bytes;
return bytes;
}
}
```
```
maxEdits termLength build µs/op retained bytes/build
1 16 71.5 ~49 KB
1 32 180.5 ~115 KB
1 64 351.2 ~225 KB
1 128 705.8 ~445 KB
1 256 1,424.4 ~885 KB
2 16 882.8 ~309 KB
2 32 2,589.5 ~770 KB
2 64 5,819.1 ~1.52 MB
2 128 10,838.3 ~3.04 MB
2 256 25,701.7 ~6.10 MB
```
Two things are clear:
- For a fixed `maxEdits`, the memory grows roughly linearly with
t`ermLength`.
- Going from `maxEdits=1` to `maxEdits=2` adds a constant factor of about
**6–7×**. It is not the kind of blow-up we see with `regexp` or `wildcard`.
So the `DFA` is small and predictable.
> exposing too many guts may hinder future improvements (e.g. removing
tableization)
I agree that removing tableization sounds like a good change, and it would
lower these numbers. But that is a separate improvement. It changes the
constant, not the fact that the cost depends on the input.
What mostly matters to me is **concurrency**. The numbers above are the cost
of `one` query. The same automata are shared across segments inside that query,
so segment count does not multiply the cost, but the per-query cost is still
held for the full lifetime of the search. On a busy server you have many fuzzy
queries running at the same time, and long terms (names, emails, IDs close to
length 256) do happen in real workloads. A few **MB** per query, times many
concurrent searches, adds up. Today this cost is invisible to any per-request
accounting. A future change that removes tableization will make the constant
smaller, but the variability per request is still there as long as `termLength`
and `maxEdits` come from the user. I think giving callers a way to measure this
cost per request is exactly what accounting hooks are for.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]