Invest the time to save as parquet. CSV is slow even though Arrow is faster at reading it than read.csv.
If you can process a bunch of cases at once with vectorized calculations that will also speed things up... but not all algorithms are vectorizable. On May 24, 2026 5:19:36 PM PDT, Naresh Gurbuxani <[email protected]> wrote: >Files are on a local network drive. > >I ended up creating a duckdb database and writing all the data into a couple >of tables. Database is approximately 200 GB. > >Initially I was directly reading these files one at a time, doing the >analysis, keeping analysis results, then moving to next file. Going through >all the files took a few hours. Then, if I wanted to tweak the analysis, I >needed to start over. > >I am looking for tools to get faster access to data files, preferably without >resaving data. Some analysis requires a small subset of data. If all the >data were in a few files, then in memory duckdb would work. There would be no >need to resave data. But with so many files, writing data into duckdb >database was needed. > >My analysis is mostly complete. For next time, I want to see if arrow + >duckdb will help avoid resaving data in another format. > >Sent from my iPhone > >On May 24, 2026, at 5:41 PM, John Kane <[email protected]> wrote: > > >I am not really sure what you are doing here. > >Where are the files stored? Are they in one place? >What size are they? > >On Sun, 24 May 2026 at 09:35, Naresh Gurbuxani ><[email protected]<mailto:[email protected]>> wrote: > >I have approximately ten thousand csv files with identical columns and >formats. I want to run some SQL queries on a virtual database, where all >of these files are treated as one table. While it is possible to run >SQL query, dbListTables() does not show this table. Is it possible to >list all tables including those created from arrow FileSystem? > >Is it possible to achieve this result without arrow package? > ># Create example data >library(data.table) >data("flights", package = "nycflights13") >fwrite(flights[(origin == "EWR")], "data/flights/ewr_flights.csv") >fwrite(flights[(origin == "JFK")], "data/flights/jfk_flights.csv") >fwrite(flights[(origin == "LGA")], "data/flights/lga_flights.csv") > >data("airports", package = "nycflights13") >fwrite(airports, "data/airports.csv") > ># Verify data saved as intended >dir("data") >[1] "airports.csv" "flights" >dir("data/flights/") >[1] "ewr_flights.csv" "jfk_flights.csv" "lga_flights.csv" > ># Create virtual database with two tables >library(arrow) >library(duckdb) > ># csv file successfully registed as a table >con <- dbConnect(duckdb()) >duckdb_read_csv(con, "airports", "data/airports.csv") >dbListTables(con) >[1] "airports" > ># flights_arrow does not show up as a table >flights_arrow <- open_csv_dataset("data/flights") >duckdb_register_arrow(con, "flights", flights_arrow) >dbListTables(con) >[1] "airports" >dbGetQuery(con, "SELECT table_name FROM information_schema.tables;") > table_name >1 airports > ># SQL queries can be run on flights table >dbGetQuery(con, "SELECT * FROM flights LIMIT 2;") > year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time > arr_delay carrier >1 2013 1 1 517 515 2 830 819 > 11 UA >2 2013 1 1 554 558 -4 740 728 > 12 UA > flight tailnum origin dest air_time distance hour minute time_hour >1 1545 N14228 EWR IAH 227 1400 5 15 2013-01-01 10:00:00 >2 1696 N39463 EWR ORD 150 719 5 58 2013-01-01 10:00:00 > >______________________________________________ >[email protected]<mailto:[email protected]> mailing list -- To >UNSUBSCRIBE and more, see >https://stat.ethz.ch/mailman/listinfo/r-help >PLEASE do read the posting guide https://www.R-project.org/posting-guide.html >and provide commented, minimal, self-contained, reproducible code. > > >-- >John Kane >Kingston ON Canada > > [[alternative HTML version deleted]] > >______________________________________________ >[email protected] mailing list -- To UNSUBSCRIBE and more, see >https://stat.ethz.ch/mailman/listinfo/r-help >PLEASE do read the posting guide https://www.R-project.org/posting-guide.html >and provide commented, minimal, self-contained, reproducible code. -- Sent from my phone. Please excuse my brevity. [[alternative HTML version deleted]] ______________________________________________ [email protected] mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide https://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.

