bert-beyondloops opened a new issue, #21928:
URL: https://github.com/apache/datafusion/issues/21928
### Describe the bug
ScalarValue::compacted() is documented as producing a scalar that minimises
its memory footprint by discarding unreferenced array data. For most array
types this works correctly, but for Utf8View and BinaryView (and any container
type — Struct, List, LargeList, … — whose leaf values have a view type), the
method silently fails to release the original buffer allocation. The scalar
continues to hold a live Arc reference into the source batch, keeping the
entire batch allocation alive for as long as the scalar exists.
ScalarValue::compacted() eventually calls copy_array_data, which for
view-based arrays Arc-clones the existing data buffers rather than copying the
bytes that the scalar actually references. View arrays can carry multiple
large, discontiguous data buffers; a single-character view holds a 128-bit
inline or pointer-style descriptor that may reference a tiny slice deep inside
a 64 MiB buffer. After compacted() the Arc count of those buffers is
incremented by one, but the allocations themselves are unchanged.
The correct primitive is StringViewArray::gc() / BinaryViewArray::gc(),
which copies only the live bytes into a fresh, right-sized allocation and drops
the originals. DataFusion's ScalarValue::compacted() never calls this method.
### To Reproduce
_No response_
### Expected behavior
After scalar.compacted(), the scalar's total heap allocation should be
proportional to the data it actually contains — not to the source batch it was
originally extracted from.
### Additional context
_No response_
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]