[I] R tidyr::unnest function for arrow dataset object containing many parquet objects [arrow]

via GitHub Mon, 26 Feb 2024 16:25:27 -0800


michaelgaunt404 opened a new issue, #40255:
URL: https://github.com/apache/arrow/issues/40255

### Describe the usage question you have. Please include as many useful
details as possible.

Is there a tidyr::unnest equivalent for Arrow datasets with multiple Parquet
files?

I need to handle close to a hundred terabytes of Parquet files. Each file
has an attribute with nested tables, and within these tables, there's another
attribute containing OpenStreetMap IDs that require filtering. I need to
cross-reference these IDs with attributes from another index. If it were a flat
file or a long "tidy" data frame, it wouldn't be an issue, but the nested
structure is complicating matters with the Arrow dataset object.

Currently, I employ an iterative approach, loading individual Parquet files
into memory for filtering and saving (Actually I do this in parallel with the
avaible cores on my computer). However, I've come across Arrow datasets, and
the ability to lazily define operations before loading the object could greatly
enhance speed.

See below images for reference of the data Im working with.

![image](https://github.com/apache/arrow/assets/60335544/90d1ff7f-d83a-40c3-b344-20587496c6e5)

![image](https://github.com/apache/arrow/assets/60335544/e769ae66-0762-4a37-b355-186b162c8042)

### Component(s)

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] R tidyr::unnest function for arrow dataset object containing many parquet objects [arrow]

Reply via email to