JakeRuss opened a new issue, #44889: URL: https://github.com/apache/arrow/issues/44889
### Describe the enhancement requested I have a dataset which is hosted on AWS S3 as hive partitioned parquet files. The data is written to S3 by a Python job via pandas with snappy compression and the resulting file names look like this, ```` s3://bucket-name/dataset/year=2024/month=11/day=25/hour_ending=10/a0866a30008848e1ba878514954e4d6a.snappy.parquet ```` For our use case the python job runs each hour and new hourly data is written out, but since the most recent 24 hours might have updates, the last 24 hours are also written/rewritten each hour. In the above example, each time that hour ending 10 file is written, it gets a new random string as part of the snappy file name. The problem I am running into is after I `open_dataset("s3://bucket-name/dataset/")` in R, and then query this dataset, depending on my timing, I sometimes hit an error where arrow is looking for a file name that no longer exists (because the python job updated the most recent 24 hour partitions after I opened the dataset but before the query can finish. I get this error, ```` Error in `compute.arrow_dplyr_query()`: ! IOError: Could not open Parquet input source 'bucket-name/dataset/year=2024/month=11/day=25/hour_ending=10/5d7020e501db412b9e729bb0b5da948b.snappy.parquet': AWS Error NO_SUCH_KEY during GetObject operation: The specified key does not exist. ```` Would it be possible (advisable?) to add an option which allows reading whatever parquet file is in a partition, regardless of whether the parquet file name has changed mid-query? ### Component(s) R -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org