Files are on a local network drive. I ended up creating a duckdb database and writing all the data into a couple of tables. Database is approximately 200 GB.
Initially I was directly reading these files one at a time, doing the analysis, keeping analysis results, then moving to next file. Going through all the files took a few hours. Then, if I wanted to tweak the analysis, I needed to start over. I am looking for tools to get faster access to data files, preferably without resaving data. Some analysis requires a small subset of data. If all the data were in a few files, then in memory duckdb would work. There would be no need to resave data. But with so many files, writing data into duckdb database was needed. My analysis is mostly complete. For next time, I want to see if arrow + duckdb will help avoid resaving data in another format. Sent from my iPhone On May 24, 2026, at 5:41 PM, John Kane <[email protected]> wrote: I am not really sure what you are doing here. Where are the files stored? Are they in one place? What size are they? On Sun, 24 May 2026 at 09:35, Naresh Gurbuxani <[email protected]<mailto:[email protected]>> wrote: I have approximately ten thousand csv files with identical columns and formats. I want to run some SQL queries on a virtual database, where all of these files are treated as one table. While it is possible to run SQL query, dbListTables() does not show this table. Is it possible to list all tables including those created from arrow FileSystem? Is it possible to achieve this result without arrow package? # Create example data library(data.table) data("flights", package = "nycflights13") fwrite(flights[(origin == "EWR")], "data/flights/ewr_flights.csv") fwrite(flights[(origin == "JFK")], "data/flights/jfk_flights.csv") fwrite(flights[(origin == "LGA")], "data/flights/lga_flights.csv") data("airports", package = "nycflights13") fwrite(airports, "data/airports.csv") # Verify data saved as intended dir("data") [1] "airports.csv" "flights" dir("data/flights/") [1] "ewr_flights.csv" "jfk_flights.csv" "lga_flights.csv" # Create virtual database with two tables library(arrow) library(duckdb) # csv file successfully registed as a table con <- dbConnect(duckdb()) duckdb_read_csv(con, "airports", "data/airports.csv") dbListTables(con) [1] "airports" # flights_arrow does not show up as a table flights_arrow <- open_csv_dataset("data/flights") duckdb_register_arrow(con, "flights", flights_arrow) dbListTables(con) [1] "airports" dbGetQuery(con, "SELECT table_name FROM information_schema.tables;") table_name 1 airports # SQL queries can be run on flights table dbGetQuery(con, "SELECT * FROM flights LIMIT 2;") year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier 1 2013 1 1 517 515 2 830 819 11 UA 2 2013 1 1 554 558 -4 740 728 12 UA flight tailnum origin dest air_time distance hour minute time_hour 1 1545 N14228 EWR IAH 227 1400 5 15 2013-01-01 10:00:00 2 1696 N39463 EWR ORD 150 719 5 58 2013-01-01 10:00:00 ______________________________________________ [email protected]<mailto:[email protected]> mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide https://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- John Kane Kingston ON Canada [[alternative HTML version deleted]] ______________________________________________ [email protected] mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide https://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.

