salvatorecampagna opened a new issue, #16113:
URL: https://github.com/apache/lucene/issues/16113

   ### Description
   
   ## Problem
   
   `Accountable.ramBytesUsed()` is specified as "an estimate of the bytes used 
by this object" without saying whether bytes from referenced objects are 
included. In practice, implementations in the Lucene codebase vary: some count 
everything reachable from the object, including borrowed/shared state passed in 
via the constructor; some count only what they allocated; a few have explicit 
comments noting that a referenced object "should be accounted separately."
   
   When external code sums `ramBytesUsed()` across multiple `Accountable` 
instances that share state, the same memory can be counted more than once.
   
   `getChildResources()` is conventionally a diagnostic accessor for printing 
and inspection. It does not advertise itself as a deduplicated ownership tree, 
and consumers should not assume that summing over it yields a well-defined 
per-node total.
   
   ## Evidence in the codebase
   
   **`GlobalOrdinalsQuery`** counts a shared `OrdinalMap` received through its 
constructor:
   
   ```
   // 
lucene/join/src/java/org/apache/lucene/search/join/GlobalOrdinalsQuery.java
   this.ramBytesUsed = BASE_RAM_BYTES + ... + 
RamUsageEstimator.sizeOfObject(this.globalOrds) + ...
   ```
   
   The `OrdinalMap` is cached on the top reader context and shared across 
queries that target the same join field on the same index. Two queries that 
include it each report its full size; the same bytes are credited to multiple 
queries.
   
   **`GlobalOrdinalsWithScoreQuery`** in 
`lucene/join/src/java/org/apache/lucene/search/join/GlobalOrdinalsWithScoreQuery.java`
 has the same pattern at the equivalent site.
   
   **`Lucene99HnswVectorsWriter.FieldWriter`** includes 
`flatFieldVectorsWriter.ramBytesUsed()` in its own count. The outer writer's 
per-field loop comments that the field tracks the delegate's usage and the 
outer omits the delegate from its own count to compensate. The pair has 
cross-class coupling: the inner counts something it does not allocate, and the 
outer relies on that to avoid counting it itself. The current accounting 
therefore depends on the inner and outer staying in step.
   
   ## Downstream impact
   
   Embedders (Elasticsearch, Solr, OpenSearch, custom indexers) build memory 
inventories by summing `ramBytesUsed()` across the Accountables they hold open. 
When two Accountables that share state both report the shared size, the totals 
inflate, and downstream consumers cannot tell from the API which value reflects 
actual heap. Concrete consequences:
   
   - Capacity planning and circuit-breaker thresholds drift; breakers trip at 
the wrong actual heap.
   - Diagnostics become forensic: "why does this reader report N GB on an M GB 
heap?" is not answerable from the API alone.
   - Newer subsystems that build categorical inventories (per-shard, 
per-segment, per-tenant) cannot produce useful totals without a deduplication 
layer that papers over the ambiguity.
   
   ## Proposal
   
   This issue proposes three concrete actions:
   
   1. **Clarify `Accountable`'s javadoc** to recommend an ownership-oriented 
convention for implementations, especially new and updated ones. Define "owned" 
as memory allocated for this object's lifetime, or released by its `close()` if 
applicable. Inputs received via constructors or factories may be borrowed, 
wrapped, sliced, or copied; implementations should report only the bytes they 
actually own, not the deep content of referenced storage they do not own. The 
reference slot for a borrowed input is part of the holder's own object layout; 
re-adding the referent's deep content can lead to double counting when 
consumers sum across multiple Accountables. State that `getChildResources()` is 
a diagnostic accessor and not a deduplicated ownership tree, and that summing 
over it is not sum-safe in general.
   
   2. **Document the convention** alongside the javadoc clarification. Report 
bytes you allocate, including derived allocations and your own wrapper 
overhead; do not report deep content of storage you only reference; if you 
`close()` something, you report its bytes. Recommend the convention for new and 
updated implementations.
   
   3. **Fix clear offenders incrementally** with targeted PRs. Anchor cases for 
this issue: the two `GlobalOrdinals*` queries and the HNSW writer pair above. 
Each PR addresses one or a few classes, includes the size of the doubly-counted 
object and the resulting change to the reported value, and documents the 
semantic shift in the PR description.
   
   ## Backward compatibility
   
   The proposal does not change the `Accountable` interface contract or 
signature. The javadoc clarification documents a convention already used in 
parts of the codebase.
   
   Targeted fixes to clear offenders will change the numbers those classes 
report. Downstream consumers that have tuned thresholds against the inflated 
values will see different numbers after each fix lands. Each such change is 
bounded by three constraints:
   
   - the existing behavior is clearly misleading (counts widely-shared state, 
like the `OrdinalMap` cases above),
   - the fix is narrow (one or a few classes per PR),
   - the PR description documents the semantic shift so downstream consumers 
can adjust expectations.
   
   `Accountable.getChildResources()` is unchanged. Existing patterns where a 
parent class already aggregates child `ramBytesUsed()` while also exposing 
those children via `getChildResources()` do not need to change. Examples in the 
codebase:
   
   - `DocumentsWriterPerThread` aggregates child `ramBytesUsed()` into its 
total while also returning children via `getChildResources()`.
   - `IndexingChain` does the same.
   - `TestLRUQueryCache` and similar tests assume aggregate semantics on 
existing implementations.
   
   These continue to work as today.
   
   ## Open questions
   
   - What is the right wording for the javadoc clarification? "Owned" needs to 
be defined unambiguously and in terms that fit Java/Lucene conventions.
   - Should the convention live in `Accountable`'s class javadoc, in a 
`package-info.java`, or in a dedicated developer-docs page? The first is most 
visible to implementers.
   - Should developer docs encourage authors of new Accountable implementations 
to call out which fields are owned vs borrowed in code comments at 
construction? Useful for review, with some friction.
   
   ## Related issues
   
   - [apache/lucene#15097](https://github.com/apache/lucene/issues/15097) 
(closed): "Incorrect size estimation of BooleanQuery in LRUQueryCache causing 
heap exhaustion". Operational evidence that query-size estimation affects cache 
budgets: `BooleanQuery` falling back to a default size in `LRUQueryCache` 
contributed to heap exhaustion in production.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to