On 5/25/26 04:46, Naresh Gurbuxani wrote:


" If all the data were in a few files, then in memory duckdb would work."

I only need a subset of data at any time.  Duckdb allows a virtual table for 
each file.  This not practical with thousands of files.  With a few large 
files, this can work.  Here the goal is to establish a connection, not to load 
all data at once.

It the files have the same columns, you can also also open all files into one virtual database using duckdb. The code below creates a virtual table view called 'flights' with the data from all csv files in data/.

con <- duckdb::dbConnect(duckdb::duckdb())

sql <- paste0("CREATE OR REPLACE VIEW flights AS "
  "SELECT * FROM read_csv('data/**/*.csv');")
DBI::dbExecute(con, sql)

dbListTables(con)

dbGetQuery(con, "SELECT * FROM flights;")


duckdb is fast and will do things in parallel, but for every query it will have to go through all files. Going through 200GB of data will take time. So, if you have to query the data repeatedly it is probably going to speed up your code significantly if you resave your data in another format.

HTH,

Jan

______________________________________________
[email protected] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide https://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to