xWaita opened a new issue, #45038: URL: https://github.com/apache/arrow/issues/45038
### Describe the bug, including details regarding any error messages, version, and platform. According to the docs of pyarrow.dataset.write_dataset, when max_open_files is reached, the least recently used file will be closed (https://arrow.apache.org/docs/python/generated/pyarrow.dataset.write_dataset.html). However in the C++ code of dataset writer, what actually happens is the largest file is closed (https://github.com/apache/arrow/blob/main/cpp/src/arrow/dataset/dataset_writer.cc#L656). For long running dataset writes with many partitions, this means over time the set of open files will trend towards being smaller and smaller in size. Recently opened files will begin to be closed prematurely, while very old open files will hang around and stay in memory. It would probably make sense for C++ behaviour to be brought in line with the docs, which should be to close the oldest file rather than the largest file. ### Component(s) C++, Python -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org