[I] Handling large data [arrow]

via GitHub Fri, 28 Feb 2025 23:58:25 -0800


ag1805x opened a new issue, #45645:
URL: https://github.com/apache/arrow/issues/45645


   ### Describe the usage question you have. Please include as many useful 
details as  possible.
   
   
   I'm working with 50 Parquet files (~800MB each) and need to perform a 
grouped summarization in R (group_by(colA, colB, colC)). When using arrow, I 
encounter memory issues (core dumped, bad_alloc). What is the best way to 
handle this large data without running into memory errors? The experimental 
batch processing seemed like an option but I will not be able to make batches 
by random sub-setting. Rather, it would be ideal to sub-set via the group_by 
columns. Is this possible?
   
   ### Component(s)
   
   R


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[I] Handling large data [arrow]

Reply via email to