drempapis commented on PR #16031:
URL: https://github.com/apache/lucene/pull/16031#issuecomment-4380467888

   
   I don't think `FuzzyQuery` belongs in the same group as `RegexpQuery`, 
`WildcardQuery`, `PrefixQuery`, and `TermRangeQuery`. Those queries take user 
patterns, and the compiled automaton can grow without a clear bound. That's why 
they keep the automaton on the query and report it through 
`Accountable#ramBytesUsed()`.
   
   I agree, `FuzzyQuery` is different. Its `DFA` only depends on `termLength` 
and `maxEdits`. maxEdits is capped at **2**, and `prefixLength` is clipped to 
the term length. So the size and the build time are bounded and `predictable`.
   
   I ran a small JMH benchmark to investigate this
   ```
   import java.util.concurrent.TimeUnit;
   import org.apache.lucene.index.Term;
   import org.apache.lucene.search.FuzzyQuery;
   import org.apache.lucene.util.AttributeSource;
   import org.openjdk.jmh.annotations.AuxCounters;
   import org.openjdk.jmh.annotations.Benchmark;
   import org.openjdk.jmh.annotations.BenchmarkMode;
   import org.openjdk.jmh.annotations.Fork;
   import org.openjdk.jmh.annotations.Level;
   import org.openjdk.jmh.annotations.Measurement;
   import org.openjdk.jmh.annotations.Mode;
   import org.openjdk.jmh.annotations.OutputTimeUnit;
   import org.openjdk.jmh.annotations.Param;
   import org.openjdk.jmh.annotations.Scope;
   import org.openjdk.jmh.annotations.Setup;
   import org.openjdk.jmh.annotations.State;
   import org.openjdk.jmh.annotations.Warmup;
   
   
   @BenchmarkMode(Mode.AverageTime)
   @OutputTimeUnit(TimeUnit.MICROSECONDS)
   @State(Scope.Benchmark)
   @Fork(value = 1, warmups = 1)
   @Warmup(iterations = 3, time = 1)
   @Measurement(iterations = 5, time = 1)
   public class FuzzyAutomatonRamBenchmark {
   
     @Param({"16", "32", "64", "128", "256"})
     public int termLength;
   
     @Param({"1", "2"})
     public int maxEdits;
   
     private FuzzyQuery query;
   
     @AuxCounters(AuxCounters.Type.EVENTS)
     @State(Scope.Thread)
     public static class AutomataRam {
       public long ramBytes;
     }
   
     @Setup(Level.Trial)
     public void setup() {
       StringBuilder sb = new StringBuilder(termLength);
       for (int i = 0; i < termLength; i++) {
         sb.append((char) ('a' + (i % 26)));
       }
       query =
           new FuzzyQuery(
               new Term("f", sb.toString()),
               maxEdits,
               0,
               FuzzyQuery.defaultMaxExpansions,
               true);
     }
   
     @Benchmark
     public long buildAutomata(AutomataRam metrics) {
       long bytes = query.computeAutomataRamBytes(new AttributeSource());
       metrics.ramBytes = bytes;
       return bytes;
     }
   }
   ```
   ```
   maxEdits  termLength  build µs/op   retained bytes/build
       1         16          71.5         ~49 KB
       1         32         180.5        ~115 KB
       1         64         351.2        ~225 KB
       1        128         705.8        ~445 KB
       1        256       1,424.4        ~885 KB
       2         16         882.8        ~309 KB
       2         32       2,589.5        ~770 KB
       2         64       5,819.1       ~1.52 MB
       2        128      10,838.3       ~3.04 MB
       2        256      25,701.7       ~6.10 MB
   ```
   
   Two things are clear:
   
   -  For a fixed `maxEdits`, the memory grows roughly linearly with 
t`ermLength`.
   -  Going from `maxEdits=1` to `maxEdits=2` adds a constant factor of about 
**6–7×**. It is not the kind of blow-up we see with `regexp` or `wildcard`.
   
   So the `DFA` is small and predictable.
   
   > exposing too many guts may hinder future improvements (e.g. removing 
tableization)
   
   I agree that removing tableization sounds like a good change, and it would 
lower these numbers. But that is a separate improvement. It changes the 
constant, not the fact that the cost depends on the input.
   
   What mostly matters to me is **concurrency**. The numbers above are the cost 
of `one` query. The same automata are shared across segments inside that query, 
so segment count does not multiply the cost, but the per-query cost is still 
held for the full lifetime of the search. On a busy server you have many fuzzy 
queries running at the same time, and long terms (names, emails, IDs close to 
length 256) do happen in real workloads. A few **MB** per query, times many 
concurrent searches, adds up. Today this cost is invisible to any per-request 
accounting. A future change that removes tableization will make the constant 
smaller, but the variability per request is still there as long as `termLength` 
and `maxEdits` come from the user. I think giving callers a way to measure this 
cost per request is exactly what accounting hooks are for.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to