reugn opened a new pull request, #16046:
URL: https://github.com/apache/lucene/pull/16046

   ### Description
   
   `CompiledAutomaton.ramBytesUsed()` counts the underlying Automaton twice on 
the DFA path, over-reporting retained heap by 18–35% on non-trivial wildcard 
and regexp queries.
   
   The `automaton` field is aliased to `runAutomaton.automaton` — a single 
Automaton instance referenced from two places:
   
   ```java
   // CompiledAutomaton.java:261-263
   runAutomaton = new ByteRunAutomaton(binary, true);
   this.automaton = runAutomaton.automaton;   // same reference
   ```
   
   `ramBytesUsed()` accounts for it twice: once directly via 
`sizeOfObject(automaton)`, and again through `sizeOfObject(runAutomaton)` which 
delegates to `RunAutomaton.ramBytesUsed()` and adds `sizeOfObject(automaton)` 
itself.
   
   The fix is to drop the redundant `sizeOfObject(automaton)` from 
`CompiledAutomaton.ramBytesUsed()`. In the NFA branch both fields are `null`, 
so this is a no-op there.
   
   ```java
   import org.apache.lucene.index.Term;
   import org.apache.lucene.search.WildcardQuery;
   import org.apache.lucene.util.automaton.*;
   
   Automaton dfa = Operations.determinize(
       WildcardQuery.toAutomaton(new Term("f", "*" + "x".repeat(3000) + "*")),
       Integer.MAX_VALUE);
   CompiledAutomaton ca = new CompiledAutomaton(dfa, false, true, false);
   
   System.out.println(ca.ramBytesUsed());
   // Before: 8_361_506   (over-reports by 2.15 MB, ratio ~1.35)
   // After:  6_214_953   (matches retained heap within 159 bytes)
   ```
   
   Cross-checked against 
org.openjdk.jol.info.GraphLayout.parseInstance(ca).totalSize() — reported below 
is CompiledAutomaton.ramBytesUsed(), retained is JOL's GraphLayout.totalSize().
   
   | Pattern | reported (before) | reported (after) | retained |
   |---|---:|---:|---:|
   | `*foo*` | 13,634 | 10,593 | 10,752 |
   | `*foo*bar*baz*` | 53,586 | 42,921 | 43,080 |
   | `*a*b*c*d*e*f*g*` | 79,282 | 61,577 | 61,736 |
   | `*` + `x`×1000 + `*` | 2,789,994 | 2,073,945 | 2,074,104 |
   | `*` + `x`×3000 + `*` | 8,361,506 | 6,214,953 | 6,215,112 |
   | `*a*b*…*j*` (depth 10) | 154,522 | 121,457 | 121,616 |
   | `*a*b*…*t*` (depth 20) | 677,562 | 548,497 | 548,656 |
   | `.*foo.*bar.*` | 34,258 | 27,161 | 27,320 |
   
   After the fix, ramBytesUsed() lands within a fixed +159-byte offset of 
actual retained heap across all cases.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to