IamJeffG opened a new issue, #37820:
URL: https://github.com/apache/arrow/issues/37820

   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   I am trying to compact the many small fragments of a Parquet dataset, 
writing the result to a new path on disk. The dataset is very large, but is 
partitioned. My life will be so much simpler if I can do this out-of-core on a 
single VM with only about 8GB RAM.  I believe this should be possible, since 
any single partition is small enough to fully fit into RAM, and I am not 
changing how the dataset is partitioned: should be able to just process one 
partition at a time.
   
   The Dataset API seems like it should do this elegantly, but when I run it, 
the Python process's RSS grows unbounded as it continues to write compacted 
partitions to the destination, until eventually it runs out of memory and gets 
killed.  Test runs on smaller datasets do finish correctly and without error.
   
   Minimum reproducible example: 
https://gist.github.com/IamJeffG/04be8bdedacb144a865d2fec3f5264c4
   Platform: Linux amd64
   Pyarrow versions tested: 12.0.1 and 13.0.0
   
   This might be a duplicate or near-dupe of 
https://github.com/apache/arrow/issues/37630 -- I'm unable to say for sure:
   - Like that issue, I also see memory allocated reported by any 
`jemalloc_memory_pool`, `mimalloc_memory_pool` and `system_memory_pool` is very 
small or 0, and remains constant, even while the system shows my Python 
process's RAM usage growing unbounded.
   - Unlike that issue I don't see memory use continue to grow when running the 
scan multiple times.
   
   Even when I manually iterate over the partitions of the original dataset 
(also shown in the gist) and write them one-by-one, the program grows its RAM 
usage in the same way.
   
   Things suggested in other issues that I tried but do not work:
   - set `max_open_files=10`  doesn't affect memory usage (it just increases 
the fragments written to the destination, not ideal).
   - `gc.collect()` between each iteration makes no difference.
   - [`pa.jemalloc_set_decay_ms(0)`](https://stackoverflow.com/a/74045529) 
makes no difference, which is not surprising given previous observations.
   
   Is there a way to prevent memory from growing from one partition to the 
next?  I imagine a workaround might exist, in which I iterate through folders 
on disk first, and create a new Dataset from each, but then I lose all the nice 
things about Arrow understanding partitions for me.
   
   
   
   ### Component(s)
   
   Parquet, Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to