[I] [R] Add option to ignore partition file names when querying an arrow open_dataset() ? [arrow]

via GitHub Sat, 30 Nov 2024 18:07:42 -0800


JakeRuss opened a new issue, #44889:
URL: https://github.com/apache/arrow/issues/44889


   ### Describe the enhancement requested
   
   I have a dataset which is hosted on AWS S3 as hive partitioned parquet 
files. The data is written to S3 by a Python job via pandas with snappy 
compression and the resulting file names look like this, 
   ````
   
s3://bucket-name/dataset/year=2024/month=11/day=25/hour_ending=10/a0866a30008848e1ba878514954e4d6a.snappy.parquet
   ````
   For our use case the python job runs each hour and new hourly data is 
written out, but since the most recent 24 hours might have updates, the last 24 
hours are also written/rewritten each hour. In the above example, each time 
that hour ending 10 file is written, it gets a new random string as part of the 
snappy file name.
   
   The problem I am running into is after I 
`open_dataset("s3://bucket-name/dataset/")` in R, and then query this dataset,  
depending on my timing, I sometimes hit an error where arrow is looking for a 
file name that no longer exists (because the python job updated the most recent 
24 hour partitions after I opened the dataset but before the query can finish. 
I get this error,
   ````
   Error in `compute.arrow_dplyr_query()`:
   ! IOError: Could not open Parquet input source 
'bucket-name/dataset/year=2024/month=11/day=25/hour_ending=10/5d7020e501db412b9e729bb0b5da948b.snappy.parquet':
 AWS Error NO_SUCH_KEY during GetObject operation: The specified key does not 
exist.
   ````
   Would it be possible (advisable?) to add an option which allows reading 
whatever parquet file is in a partition, regardless of whether the parquet file 
name has changed mid-query? 
   
   
   ### Component(s)
   
   R


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[I] [R] Add option to ignore partition file names when querying an arrow open_dataset() ? [arrow]

Reply via email to