Naresh, There are times the choice of tools constrains you.
Consider what would happen if you chose a language with features readily and cheaply available that allow some kinds of concurrent programming. Python allows some of that in a somewhat interpreted way and plenty of more compiled languages may do it even more efficiently. I am talking about the ability to have a program hand out work to multiple other threads on the same machine or even distribute the work across a network of machines. In your case, consider how some search engines do a search across a wide swath of the internet. Loosely, a master program sends requests to many others and gets back results and in some sense merges them. Another example is a merge sort where at every step, one function hands half of the remaining data to each of two others to sort then waits for a result and merges the two returned streams together. This may happen using thousands of processes eventually and may return a fasgter result than many other sort methods. In your case, you could have a queue of N simultaneous threads that each process one file for you and return results. As soon as one finishes, another thread starts to do the next file. The threads will likely interweave when they are doing slow things like I/O and may run on different processors and the interleaving can be substantially faster than one process running and spending much of the time waiting for I/O and so on. I don't know if R currently supports some of this well. I have seen Python used this way and many modules are available that support things like this. Of course, you might have a learning process involved as you might need to use add-on modules like numpy and pandas rather than what is largely built-in using R and its many packages. R does have some packages that may let you do this, such as "parallel" but, again, I suspect a learning curve to get it right. I think I can sign off now as I suspect much of my thoughts may not be of help. 😉 Avi -----Original Message----- From: Naresh Gurbuxani <[email protected]> Sent: Sunday, May 24, 2026 8:26 PM To: [email protected] Cc: John Kane <[email protected]>; [email protected] Subject: Re: [R] duckdb table from multiple csv files I have done most of the analysis using tools in base R: read one file, do analysis, keep results, move to next file. Each cycle is several hours. Making small changes in analysis is costly. I want to explore methods that will provide faster access to data, but do not require resaving data. Sent from my iPhone > On May 24, 2026, at 6:33 PM, [email protected] wrote: > > Although there is some code there, we may want to step back and ask what the > person wants to do. > > Assuming you have thousands of files in CSV format and you can easily access > them one by one, then there are two major methods that seem interesting if > you want to evaluate them in some way. > > You can open all the files and combine them into one entity you can query. > > Alternatively, you can create a query that you then do repeatedly on as many > (or all) of the files till you get a (combined) result. > > The query can use native R mechanisms, or a package that does SQL if you put > the data into a database, or another package such as in the tidyverse using > dplyr and so on. > > In real life, you have to ask if the many CSV files contain valid data > without duplications or missing data and so on. If it ios not guaranteed to > be exactly what you want, then consolidating the data and running multiple > functions on it to make it what you want may be important. Removing (or > maybe fixing) some rows may make sense. It may make sense to remove columns > not needed for the query or rows that fail some condition that simplifies > later queries. It may make sense to choose subsets, remove duplicates on the > columns remaining and much more. As one of many examples, you might > determine info about some column and then remove outliers above and below a > threshold. > > If you insist on using SQL, as the final query, you could push all the > remaining data into a database. > > If your data is too large when combined, there are packages that can help > you work with them albeit this is an example where pushing it all into a > database first makes sense, albeit it can be slow. > > And note, your solution can include work outside of R such as copying all > the files (assuming n o header in each) into one file and then continuing > from R. > > Good luck. > > > -----Original Message----- > From: R-help <[email protected]> On Behalf Of John Kane > Sent: Sunday, May 24, 2026 5:41 PM > To: Naresh Gurbuxani <[email protected]> > Cc: [email protected] > Subject: Re: [R] duckdb table from multiple csv files > > I am not really sure what you are doing here. > > Where are the files stored? Are they in one place? > What size are they? > >> On Sun, 24 May 2026 at 09:35, Naresh Gurbuxani >> <[email protected]> >> wrote: >> >> >> I have approximately ten thousand csv files with identical columns and >> formats. I want to run some SQL queries on a virtual database, where all >> of these files are treated as one table. While it is possible to run >> SQL query, dbListTables() does not show this table. Is it possible to >> list all tables including those created from arrow FileSystem? >> >> Is it possible to achieve this result without arrow package? >> >> # Create example data >> library(data.table) >> data("flights", package = "nycflights13") >> fwrite(flights[(origin == "EWR")], "data/flights/ewr_flights.csv") >> fwrite(flights[(origin == "JFK")], "data/flights/jfk_flights.csv") >> fwrite(flights[(origin == "LGA")], "data/flights/lga_flights.csv") >> >> data("airports", package = "nycflights13") >> fwrite(airports, "data/airports.csv") >> >> # Verify data saved as intended >> dir("data") >> [1] "airports.csv" "flights" >> dir("data/flights/") >> [1] "ewr_flights.csv" "jfk_flights.csv" "lga_flights.csv" >> >> # Create virtual database with two tables >> library(arrow) >> library(duckdb) >> >> # csv file successfully registed as a table >> con <- dbConnect(duckdb()) >> duckdb_read_csv(con, "airports", "data/airports.csv") >> dbListTables(con) >> [1] "airports" >> >> # flights_arrow does not show up as a table >> flights_arrow <- open_csv_dataset("data/flights") >> duckdb_register_arrow(con, "flights", flights_arrow) >> dbListTables(con) >> [1] "airports" >> dbGetQuery(con, "SELECT table_name FROM information_schema.tables;") >> table_name >> 1 airports >> >> # SQL queries can be run on flights table >> dbGetQuery(con, "SELECT * FROM flights LIMIT 2;") >> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time >> arr_delay carrier >> 1 2013 1 1 517 515 2 830 >> 819 11 UA >> 2 2013 1 1 554 558 -4 740 >> 728 12 UA >> flight tailnum origin dest air_time distance hour minute >> time_hour >> 1 1545 N14228 EWR IAH 227 1400 5 15 2013-01-01 >> 10:00:00 >> 2 1696 N39463 EWR ORD 150 719 5 58 2013-01-01 >> 10:00:00 >> >> ______________________________________________ >> [email protected] mailing list -- To UNSUBSCRIBE and more, see >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide >> https://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. >> > > > -- > John Kane > Kingston ON Canada > > [[alternative HTML version deleted]] > > ______________________________________________ > [email protected] mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > https://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > ______________________________________________ [email protected] mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide https://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.

