reugn opened a new pull request, #16046:
URL: https://github.com/apache/lucene/pull/16046
### Description
`CompiledAutomaton.ramBytesUsed()` counts the underlying Automaton twice on
the DFA path, over-reporting retained heap by 18–35% on non-trivial wildcard
and regexp queries.
The `automaton` field is aliased to `runAutomaton.automaton` — a single
Automaton instance referenced from two places:
```java
// CompiledAutomaton.java:261-263
runAutomaton = new ByteRunAutomaton(binary, true);
this.automaton = runAutomaton.automaton; // same reference
```
`ramBytesUsed()` accounts for it twice: once directly via
`sizeOfObject(automaton)`, and again through `sizeOfObject(runAutomaton)` which
delegates to `RunAutomaton.ramBytesUsed()` and adds `sizeOfObject(automaton)`
itself.
The fix is to drop the redundant `sizeOfObject(automaton)` from
`CompiledAutomaton.ramBytesUsed()`. In the NFA branch both fields are `null`,
so this is a no-op there.
```java
import org.apache.lucene.index.Term;
import org.apache.lucene.search.WildcardQuery;
import org.apache.lucene.util.automaton.*;
Automaton dfa = Operations.determinize(
WildcardQuery.toAutomaton(new Term("f", "*" + "x".repeat(3000) + "*")),
Integer.MAX_VALUE);
CompiledAutomaton ca = new CompiledAutomaton(dfa, false, true, false);
System.out.println(ca.ramBytesUsed());
// Before: 8_361_506 (over-reports by 2.15 MB, ratio ~1.35)
// After: 6_214_953 (matches retained heap within 159 bytes)
```
Cross-checked against
org.openjdk.jol.info.GraphLayout.parseInstance(ca).totalSize() — reported below
is CompiledAutomaton.ramBytesUsed(), retained is JOL's GraphLayout.totalSize().
| Pattern | reported (before) | reported (after) | retained |
|---|---:|---:|---:|
| `*foo*` | 13,634 | 10,593 | 10,752 |
| `*foo*bar*baz*` | 53,586 | 42,921 | 43,080 |
| `*a*b*c*d*e*f*g*` | 79,282 | 61,577 | 61,736 |
| `*` + `x`×1000 + `*` | 2,789,994 | 2,073,945 | 2,074,104 |
| `*` + `x`×3000 + `*` | 8,361,506 | 6,214,953 | 6,215,112 |
| `*a*b*…*j*` (depth 10) | 154,522 | 121,457 | 121,616 |
| `*a*b*…*t*` (depth 20) | 677,562 | 548,497 | 548,656 |
| `.*foo.*bar.*` | 34,258 | 27,161 | 27,320 |
After the fix, ramBytesUsed() lands within a fixed +159-byte offset of
actual retained heap across all cases.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]