RussellSpitzer opened a new pull request, #13935:
URL: https://github.com/apache/iceberg/pull/13935

   For a very long time, we have leaked direct memory when reading a file whose 
page encoding changes from dictionary to not dictionary. Dictionary pages are 
encoded as IntVector while non-dictionary encoded pages could be any of several 
different representations based on the column type. When the vector type 
changes, we would silently drop the previous vector without clearing or 
releasing it.
   
   I determined this when attempting to work #13880 that we were having 
DirectMemory OOM's in some spark tests. After a lot of searching and debugging, 
I narrowed down the leakage to read tasks which read multiple pages with 
different encodings. Funnily enough we actually already have a test which 
checks files with mixed page encodings and that test *would* have failed if it 
actually checked for memory leaks. 
   
   In this PR I fully instrument our tests for parquet vectorized reads to 
actually check for memory leaks and to fail if they do. Without the 
accompanying patch, the dictionaryMixedPages test fails.
   
   Another thing we may want to consider in the future is whether or not we 
want to close our allocators. Currently the code based has a Heap Memory leak 
in the ArrowAllocation.rootAllocator(). Every VectroizedReadBuilder creates a 
new child allocator which is used to allocate vectors for that particular 
reader but we never close these allocators even if we close all of the vectors 
we allocate from it. If we did close these, we would have seen these memory 
issues way earlier since every application would end with a string of 
"MemoryLeak Detected" messages. I'll put this into a followup issue.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to