[I] filtering a dataset down to a single partition (dplyr interface) prior to collection still accesses other partitions, leading to slow reads [arrow]

via GitHub Thu, 14 Nov 2024 01:46:26 -0800


debrouwere opened a new issue, #44725:
URL: https://github.com/apache/arrow/issues/44725


   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   I have a hive-style partitioned Parquet dataset, and each partition consists 
of only a single file, `part-0.parquet`. When I filter down to a single 
partition using the dplyr interface, `open_dataset` still ends up unnecessarily 
accessing 10 to 15 files which leads to unexpectedly slow loads. The filtering 
itself is correct, it's the performance I'm concerned about.
   
   I am using R 4.4.1 on macOS (Intel), with a precompiled version of Arrow, 
17.0.0.1.
   
   This takes about 20 seconds for me:
   
   ```r
   since <- Sys.time()
   assessments <- 
read_parquet('build/pisa.rx/cycle=2022/country=Belgium/part-0.parquet', 
col_select = starts_with('w_'))
   until <- Sys.time()
   until - since
   ```
   
   whereas this takes about 50 seconds:
   
   ```r
   
   since <- Sys.time()
   assessments <- open_dataset('build/pisa.rx') |>
     filter(country == 'Belgium', cycle == 2022) |>
     select(starts_with('w_')) |>
     collect()
   until <- Sys.time()
   until - since
   ```
   
   and even this takes 40 seconds:
   
   ```
   rx_schema <- assessments |> schema()
   
   since <- Sys.time()
   assessments <- open_dataset('build/pisa.rx',
                               hive_style = TRUE,
                               partitioning = partitioning,
                               unify_schemas = FALSE,
                               format = 'parquet',
                               schema = rx_schema) |>
     filter(country == 'Belgium', cycle == 2022) |>
     select(starts_with('w_')) |>
     collect()
   until <- Sys.time()
   until - since
   ```
   
   and in fact *not* filtering down to a single partition seems to be faster, 
35 seconds, even though it's reading 8 times as much data:
   
   ```r
   since <- Sys.time()
   assessments <- open_dataset('build/pisa.rx') |>
     filter(country == 'Belgium') |>
     select(starts_with('w_')) |>
     collect()
   until <- Sys.time()
   until - since
   ```
   
   I know `open_dataset` is reading in so many files because I get about 10 or 
more (innocuous) "invalid metadata$r" errors (see 
https://github.com/apache/arrow/issues/40423) for a single call to 
open_dataset. Possibly it is trying to unify the schemas by peeking in these 
other files, even though `unify_schemas = FALSE`?
   
   I realize that `open_dataset` has some overhead relative to `read_parquet` 
because it has to walk the directory structure etc., but if I filter down to a 
single partition surely `open_dataset` should only access that particular 
partition?
   
   I'm not sure but I don't think "invalid metadata$r" itself is the problem 
because although the metadata contains the schema, as you can see above I have 
also tried to load the data with a valid schema pre-specified.
   
   ### Component(s)
   
   R


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[I] filtering a dataset down to a single partition (dplyr interface) prior to collection still accesses other partitions, leading to slow reads [arrow]

Reply via email to