bert-beyondloops opened a new issue, #21928:
URL: https://github.com/apache/datafusion/issues/21928

   ### Describe the bug
   
   ScalarValue::compacted() is documented as producing a scalar that minimises 
its memory footprint by discarding unreferenced array data. For most array 
types this works correctly, but for Utf8View and BinaryView (and any container 
type — Struct, List, LargeList, … — whose leaf values have a view type), the 
method silently fails to release the original buffer allocation. The scalar 
continues to hold a live Arc reference into the source batch,  keeping the 
entire batch allocation alive for as long as the scalar exists.
   
   ScalarValue::compacted() eventually calls copy_array_data, which for 
view-based arrays Arc-clones the existing data buffers rather than copying the 
bytes that the scalar actually references. View arrays can carry multiple 
large, discontiguous data buffers; a single-character view holds a 128-bit 
inline or pointer-style descriptor that may reference a tiny slice deep inside 
a 64 MiB buffer. After compacted() the Arc count of those buffers is 
incremented by one, but the allocations themselves are unchanged.
                                                                                
                                                                                
                                                                      
   The correct primitive is StringViewArray::gc() / BinaryViewArray::gc(), 
which copies only the live bytes into a fresh, right-sized allocation and drops 
the originals. DataFusion's ScalarValue::compacted() never calls this method.  
   
   
   ### To Reproduce
   
   _No response_
   
   ### Expected behavior
   
   After scalar.compacted(), the scalar's total heap allocation should be 
proportional to the data it actually contains — not to the source batch it was 
originally extracted from.
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to