Sent from my iPhone > On May 24, 2026, at 9:52 PM, [email protected] wrote: > > John, > > 200 GB is indeed a big amount of data as many machines have much less working > memory albeit they may be able to save some virtual memory on disk if > properly configured. > > I have not used the tools you mention but wonder if your underlying data > files change or are stable? > > If they mostly do not change, one possible approach is to have your program > save the list of filenames as well as the last changed date. When the program > runs again, it can check for the existence of new files or absence of old > files fairly quickly in memory. Perhaps it can then only deal with changed or > new files and try to update their data to the DB carefully (avoiding > duplicates) and thus speed up one part of your effort. Of course, this may > not handle what to do with data that is gone or if your data allows redundant > lines. > > Ideally, whatever generates data should not be saving the files at all but > deal directly with the DB. Or, if the CSV files contained a column specifying > the date the data was added, you could use that to determine updates. > > You said: > > " If all the data were in a few files, then in memory duckdb would work. " > I only need a subset of data at any time. Duckdb allows a virtual table for each file. This not practical with thousands of files. With a few large files, this can work. Here the goal is to establish a connection, not to load all data at once.
With arrow, it is possible to designate a directory as virtual table. > I wonder about that as 200GB is a huge amount. Do thousands of files take up > more space than a few big ones? > > Obviously, opening and closing lots of files is slower. Some operations do > take a long time or use extra space so how you concatenate the data matters. > On something like a LINUX system, there are fairly efficient ways to > concatenate lots of files such as issuing a command within some folder that > looks like: > > cat *.csv >subfolder/bigfile.csv > > Again, if all share the same columns but have no header, this can run fairly > quickly outside of R and then you read in bigfile.csv and later delete it. I > note when not in use, files like that can be kept compressed and some methods > even allow reading them in a compressed format, at some expense. If all you > need is a random sample, I can imagine ways to read in less data. > > And, ask yourself if the data in memory can be compressed in another way. As > an example, R supports a data structure called factors for some kinds of > data. If you have a column that stores something like the name of the US > State, making it a factor as it is read in may require storing North Carolina > as number 22 and so on. TRUE/FALSE in a column might be replaced with a > Boolean value of 0/1. There are other such techniques. This could be done > carefully so the final in-memory data structure is smaller, and I noted > earlier what happens if you simply remove many columns, and perhaps rows, as > you read the data in, and remove temporary variables as soon as possible. > Lots of people end up with many variations on their data remaining in memory > and overwhelming any machine. > > I am no expert, but perhaps any future such work would do better being > designed up-front in ways that place the burden more on a data-base designed > to handle large amounts of data rather than files on your machine. Going > forward, many projects involve constantly adding data to the point where it > has to be distributed across a cloud of machines in order to work at all. > > > > -----Original Message----- > From: R-help <[email protected]> On Behalf Of Naresh Gurbuxani > Sent: Sunday, May 24, 2026 8:20 PM > To: John Kane <[email protected]> > Cc: [email protected] > Subject: Re: [R] duckdb table from multiple csv files > > Files are on a local network drive. > > I ended up creating a duckdb database and writing all the data into a couple > of tables. Database is approximately 200 GB. > > Initially I was directly reading these files one at a time, doing the > analysis, keeping analysis results, then moving to next file. Going through > all the files took a few hours. Then, if I wanted to tweak the analysis, I > needed to start over. > > I am looking for tools to get faster access to data files, preferably without > resaving data. Some analysis requires a small subset of data. If all the > data were in a few files, then in memory duckdb would work. There would be > no need to resave data. But with so many files, writing data into duckdb > database was needed. > > My analysis is mostly complete. For next time, I want to see if arrow + > duckdb will help avoid resaving data in another format. > > Sent from my iPhone > > On May 24, 2026, at 5:41 PM, John Kane <[email protected]> wrote: > > > I am not really sure what you are doing here. > > Where are the files stored? Are they in one place? > What size are they? > > On Sun, 24 May 2026 at 09:35, Naresh Gurbuxani > <[email protected]<mailto:[email protected]>> wrote: > > I have approximately ten thousand csv files with identical columns and > formats. I want to run some SQL queries on a virtual database, where all > of these files are treated as one table. While it is possible to run > SQL query, dbListTables() does not show this table. Is it possible to > list all tables including those created from arrow FileSystem? > > Is it possible to achieve this result without arrow package? > > # Create example data > library(data.table) > data("flights", package = "nycflights13") > fwrite(flights[(origin == "EWR")], "data/flights/ewr_flights.csv") > fwrite(flights[(origin == "JFK")], "data/flights/jfk_flights.csv") > fwrite(flights[(origin == "LGA")], "data/flights/lga_flights.csv") > > data("airports", package = "nycflights13") > fwrite(airports, "data/airports.csv") > > # Verify data saved as intended > dir("data") > [1] "airports.csv" "flights" > dir("data/flights/") > [1] "ewr_flights.csv" "jfk_flights.csv" "lga_flights.csv" > > # Create virtual database with two tables > library(arrow) > library(duckdb) > > # csv file successfully registed as a table > con <- dbConnect(duckdb()) > duckdb_read_csv(con, "airports", "data/airports.csv") > dbListTables(con) > [1] "airports" > > # flights_arrow does not show up as a table > flights_arrow <- open_csv_dataset("data/flights") > duckdb_register_arrow(con, "flights", flights_arrow) > dbListTables(con) > [1] "airports" > dbGetQuery(con, "SELECT table_name FROM information_schema.tables;") > table_name > 1 airports > > # SQL queries can be run on flights table > dbGetQuery(con, "SELECT * FROM flights LIMIT 2;") > year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time > arr_delay carrier > 1 2013 1 1 517 515 2 830 819 > 11 UA > 2 2013 1 1 554 558 -4 740 728 > 12 UA > flight tailnum origin dest air_time distance hour minute time_hour > 1 1545 N14228 EWR IAH 227 1400 5 15 2013-01-01 10:00:00 > 2 1696 N39463 EWR ORD 150 719 5 58 2013-01-01 10:00:00 > > ______________________________________________ > [email protected]<mailto:[email protected]> mailing list -- To > UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide https://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > > > -- > John Kane > Kingston ON Canada > > [[alternative HTML version deleted]] > > ______________________________________________ > [email protected] mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide https://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > ______________________________________________ [email protected] mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide https://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.

