michaelgaunt404 opened a new issue, #40255:
URL: https://github.com/apache/arrow/issues/40255

   ### Describe the usage question you have. Please include as many useful 
details as  possible.
   
   
   Is there a tidyr::unnest equivalent for Arrow datasets with multiple Parquet 
files? 
   
   I need to handle close to a hundred terabytes of Parquet files. Each file 
has an attribute with nested tables, and within these tables, there's another 
attribute containing OpenStreetMap IDs that require filtering. I need to 
cross-reference these IDs with attributes from another index. If it were a flat 
file or a long "tidy" data frame, it wouldn't be an issue, but the nested 
structure is complicating matters with the Arrow dataset object.
   
   Currently, I employ an iterative approach, loading individual Parquet files 
into memory for filtering and saving (Actually I do this in parallel with the 
avaible cores on my computer). However, I've come across Arrow datasets, and 
the ability to lazily define operations before loading the object could greatly 
enhance speed.
   
   See below images for reference of the data Im working with. 
   
   
![image](https://github.com/apache/arrow/assets/60335544/90d1ff7f-d83a-40c3-b344-20587496c6e5)
   
   
![image](https://github.com/apache/arrow/assets/60335544/e769ae66-0762-4a37-b355-186b162c8042)
   
   
   
   ### Component(s)
   
   R


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to