IamJeffG opened a new issue, #37820: URL: https://github.com/apache/arrow/issues/37820
### Describe the bug, including details regarding any error messages, version, and platform. I am trying to compact the many small fragments of a Parquet dataset, writing the result to a new path on disk. The dataset is very large, but is partitioned. My life will be so much simpler if I can do this out-of-core on a single VM with only about 8GB RAM. I believe this should be possible, since any single partition is small enough to fully fit into RAM, and I am not changing how the dataset is partitioned: should be able to just process one partition at a time. The Dataset API seems like it should do this elegantly, but when I run it, the Python process's RSS grows unbounded as it continues to write compacted partitions to the destination, until eventually it runs out of memory and gets killed. Test runs on smaller datasets do finish correctly and without error. Minimum reproducible example: https://gist.github.com/IamJeffG/04be8bdedacb144a865d2fec3f5264c4 Platform: Linux amd64 Pyarrow versions tested: 12.0.1 and 13.0.0 This might be a duplicate or near-dupe of https://github.com/apache/arrow/issues/37630 -- I'm unable to say for sure: - Like that issue, I also see memory allocated reported by any `jemalloc_memory_pool`, `mimalloc_memory_pool` and `system_memory_pool` is very small or 0, and remains constant, even while the system shows my Python process's RAM usage growing unbounded. - Unlike that issue I don't see memory use continue to grow when running the scan multiple times. Even when I manually iterate over the partitions of the original dataset (also shown in the gist) and write them one-by-one, the program grows its RAM usage in the same way. Things suggested in other issues that I tried but do not work: - set `max_open_files=10` doesn't affect memory usage (it just increases the fragments written to the destination, not ideal). - `gc.collect()` between each iteration makes no difference. - [`pa.jemalloc_set_decay_ms(0)`](https://stackoverflow.com/a/74045529) makes no difference, which is not surprising given previous observations. Is there a way to prevent memory from growing from one partition to the next? I imagine a workaround might exist, in which I iterate through folders on disk first, and create a new Dataset from each, but then I lose all the nice things about Arrow understanding partitions for me. ### Component(s) Parquet, Python -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org