[I] Default write_dataset min_rows_per_group parameter, 1L, can lead to very bad performance (time and memory) : [arrow]

via GitHub Sun, 07 Apr 2024 05:51:23 -0700


nbc opened a new issue, #41057:
URL: https://github.com/apache/arrow/issues/41057


   ### Describe the enhancement requested
   
   I'm not sure it's a request or a bug but when using write_dataset, the 
resulting dataset can have very small rows_per_group resulting in very bad 
performance for almost many queries : at least 20 time slower and 20 time more 
memory for a big dataset of 10GB.
   
   Setting the min_rows_per_group to something around 100000L fixes the problem.
   
   Users are not all aware of `min_rows_per_group` parameter so setting a 
"good" default (if it exists) could help them very much.
   
   I'm not qualified enough to know if there's drawbacks.
   
   
   
   ### Component(s)
   
   R


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[I] Default write_dataset min_rows_per_group parameter, 1L, can lead to very bad performance (time and memory) : [arrow]

Reply via email to