I have done most of the analysis using tools in base R: read one file, do 
analysis, keep results, move to next file.  Each cycle is several hours.  
Making small changes in analysis is costly.

I want to explore methods that will provide faster access to data, but do not 
require resaving data.


Sent from my iPhone

> On May 24, 2026, at 6:33 PM, [email protected] wrote:
> 
> Although there is some code there, we may want to step back and ask what the
> person wants to do.
> 
> Assuming you have thousands of files in CSV format and you can easily access
> them one by one, then there are two major methods that seem interesting if
> you want to evaluate them in some way.
> 
> You can open all the files and combine them into one entity you can query.
> 
> Alternatively, you can create a query that you then do repeatedly on as many
> (or all) of the files till you get a (combined) result.
> 
> The query can use native R mechanisms, or a package that does SQL if you put
> the data into a database, or another package such as in the tidyverse using
> dplyr and so on.
> 
> In real life, you have to ask if the many CSV files contain valid data
> without duplications or missing data and so on. If it ios not guaranteed to
> be exactly what you want, then consolidating the data and running multiple
> functions on it to make it what you want may be important. Removing (or
> maybe fixing) some rows may make sense. It may make sense to remove columns
> not needed for the query or rows that fail some condition that simplifies
> later queries. It may make sense to choose subsets, remove duplicates on the
> columns remaining and much more. As one of many examples, you might
> determine info about some column and then remove outliers above and below a
> threshold.
> 
> If you insist on using SQL, as the final query, you could push all the
> remaining data into a database.    
> 
> If your data is too large when combined, there are packages that can help
> you work with them albeit this is an example where pushing it all into a
> database first makes sense, albeit it can be slow.
> 
> And note, your solution can include work outside of R such as copying all
> the files (assuming n o header in each) into one file and then continuing
> from R.
> 
> Good luck.
> 
> 
> -----Original Message-----
> From: R-help <[email protected]> On Behalf Of John Kane
> Sent: Sunday, May 24, 2026 5:41 PM
> To: Naresh Gurbuxani <[email protected]>
> Cc: [email protected]
> Subject: Re: [R] duckdb table from multiple csv files
> 
> I am not really sure what you are doing here.
> 
> Where are the files stored? Are they in one place?
> What size are they?
> 
>> On Sun, 24 May 2026 at 09:35, Naresh Gurbuxani
>> <[email protected]>
>> wrote:
>> 
>> 
>> I have approximately ten thousand csv files with identical columns and
>> formats.  I want to run some SQL queries on a virtual database, where all
>> of these files are treated as one table.  While it is possible to run
>> SQL query, dbListTables() does not show this table.  Is it possible to
>> list all tables including those created from arrow FileSystem?
>> 
>> Is it possible to achieve this result without arrow package?
>> 
>> # Create example data
>> library(data.table)
>> data("flights", package = "nycflights13")
>> fwrite(flights[(origin == "EWR")], "data/flights/ewr_flights.csv")
>> fwrite(flights[(origin == "JFK")], "data/flights/jfk_flights.csv")
>> fwrite(flights[(origin == "LGA")], "data/flights/lga_flights.csv")
>> 
>> data("airports", package = "nycflights13")
>> fwrite(airports, "data/airports.csv")
>> 
>> # Verify data saved as intended
>> dir("data")
>> [1] "airports.csv" "flights"
>> dir("data/flights/")
>> [1] "ewr_flights.csv" "jfk_flights.csv" "lga_flights.csv"
>> 
>> # Create virtual database with two tables
>> library(arrow)
>> library(duckdb)
>> 
>> # csv file successfully registed as a table
>> con <- dbConnect(duckdb())
>> duckdb_read_csv(con, "airports", "data/airports.csv")
>> dbListTables(con)
>> [1] "airports"
>> 
>> # flights_arrow does not show up as a table
>> flights_arrow <- open_csv_dataset("data/flights")
>> duckdb_register_arrow(con, "flights", flights_arrow)
>> dbListTables(con)
>> [1] "airports"
>> dbGetQuery(con, "SELECT table_name FROM information_schema.tables;")
>>  table_name
>> 1   airports
>> 
>> # SQL queries can be run on flights table
>> dbGetQuery(con, "SELECT * FROM flights LIMIT 2;")
>>  year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
>> arr_delay carrier
>> 1 2013     1   1      517            515         2      830
>> 819        11      UA
>> 2 2013     1   1      554            558        -4      740
>> 728        12      UA
>>  flight tailnum origin dest air_time distance hour minute
>> time_hour
>> 1   1545  N14228    EWR  IAH      227     1400    5     15 2013-01-01
>> 10:00:00
>> 2   1696  N39463    EWR  ORD      150      719    5     58 2013-01-01
>> 10:00:00
>> 
>> ______________________________________________
>> [email protected] mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> https://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>> 
> 
> 
> --
> John Kane
> Kingston ON Canada
> 
>    [[alternative HTML version deleted]]
> 
> ______________________________________________
> [email protected] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> https://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 
______________________________________________
[email protected] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide https://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to