[I] write_dataset max_open_files does not close oldest file [arrow]

via GitHub Mon, 16 Dec 2024 22:01:52 -0800


xWaita opened a new issue, #45038:
URL: https://github.com/apache/arrow/issues/45038


   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   According to the docs of pyarrow.dataset.write_dataset, when max_open_files 
is reached, the least recently used file will be closed 
(https://arrow.apache.org/docs/python/generated/pyarrow.dataset.write_dataset.html).
   
   However in the C++ code of dataset writer, what actually happens is the 
largest file is closed 
(https://github.com/apache/arrow/blob/main/cpp/src/arrow/dataset/dataset_writer.cc#L656).
 
   
   For long running dataset writes with many partitions, this means over time 
the set of open files will trend towards being smaller and smaller in size. 
Recently opened files will begin to be closed prematurely, while very old open 
files will hang around and stay in memory. 
   
   It would probably make sense for C++ behaviour to be brought in line with 
the docs, which should be to close the oldest file rather than the largest 
file. 
   
   ### Component(s)
   
   C++, Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[I] write_dataset max_open_files does not close oldest file [arrow]

Reply via email to