Invest the time to save as parquet. CSV is slow even though Arrow is faster at 
reading it than read.csv.

If you can process a bunch of cases at once with vectorized calculations that 
will also speed things up... but not all algorithms are vectorizable.

On May 24, 2026 5:19:36 PM PDT, Naresh Gurbuxani <[email protected]> 
wrote:
>Files are on a local network drive.
>
>I ended up creating a duckdb database and writing all the data into a couple 
>of tables.  Database is approximately 200 GB.
>
>Initially I was directly reading these files one at a time, doing the 
>analysis, keeping analysis results, then moving to next file.  Going through 
>all the files took a few hours.  Then, if I wanted to tweak the analysis, I 
>needed to start over.
>
>I am looking for tools to get faster access to data files, preferably without 
>resaving data.  Some analysis requires a small subset of data.  If all the 
>data were in a few files, then in memory duckdb would work.  There would be no 
>need to resave data.  But with so many files, writing data into duckdb 
>database was needed.
>
>My analysis is mostly complete.  For next time, I want to see if arrow + 
>duckdb will help avoid resaving data in another format.
>
>Sent from my iPhone
>
>On May 24, 2026, at 5:41 PM, John Kane <[email protected]> wrote:
>
>
>I am not really sure what you are doing here.
>
>Where are the files stored? Are they in one place?
>What size are they?
>
>On Sun, 24 May 2026 at 09:35, Naresh Gurbuxani 
><[email protected]<mailto:[email protected]>> wrote:
>
>I have approximately ten thousand csv files with identical columns and
>formats.  I want to run some SQL queries on a virtual database, where all
>of these files are treated as one table.  While it is possible to run
>SQL query, dbListTables() does not show this table.  Is it possible to
>list all tables including those created from arrow FileSystem?
>
>Is it possible to achieve this result without arrow package?
>
># Create example data
>library(data.table)
>data("flights", package = "nycflights13")
>fwrite(flights[(origin == "EWR")], "data/flights/ewr_flights.csv")
>fwrite(flights[(origin == "JFK")], "data/flights/jfk_flights.csv")
>fwrite(flights[(origin == "LGA")], "data/flights/lga_flights.csv")
>
>data("airports", package = "nycflights13")
>fwrite(airports, "data/airports.csv")
>
># Verify data saved as intended
>dir("data")
>[1] "airports.csv" "flights"
>dir("data/flights/")
>[1] "ewr_flights.csv" "jfk_flights.csv" "lga_flights.csv"
>
># Create virtual database with two tables
>library(arrow)
>library(duckdb)
>
># csv file successfully registed as a table
>con <- dbConnect(duckdb())
>duckdb_read_csv(con, "airports", "data/airports.csv")
>dbListTables(con)
>[1] "airports"
>
># flights_arrow does not show up as a table
>flights_arrow <- open_csv_dataset("data/flights")
>duckdb_register_arrow(con, "flights", flights_arrow)
>dbListTables(con)
>[1] "airports"
>dbGetQuery(con, "SELECT table_name FROM information_schema.tables;")
>  table_name
>1   airports
>
># SQL queries can be run on flights table
>dbGetQuery(con, "SELECT * FROM flights LIMIT 2;")
>  year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time 
> arr_delay carrier
>1 2013     1   1      517            515         2      830            819     
>   11      UA
>2 2013     1   1      554            558        -4      740            728     
>   12      UA
>  flight tailnum origin dest air_time distance hour minute           time_hour
>1   1545  N14228    EWR  IAH      227     1400    5     15 2013-01-01 10:00:00
>2   1696  N39463    EWR  ORD      150      719    5     58 2013-01-01 10:00:00
>
>______________________________________________
>[email protected]<mailto:[email protected]> mailing list -- To 
>UNSUBSCRIBE and more, see
>https://stat.ethz.ch/mailman/listinfo/r-help
>PLEASE do read the posting guide https://www.R-project.org/posting-guide.html
>and provide commented, minimal, self-contained, reproducible code.
>
>
>--
>John Kane
>Kingston ON Canada
>
>       [[alternative HTML version deleted]]
>
>______________________________________________
>[email protected] mailing list -- To UNSUBSCRIBE and more, see
>https://stat.ethz.ch/mailman/listinfo/r-help
>PLEASE do read the posting guide https://www.R-project.org/posting-guide.html
>and provide commented, minimal, self-contained, reproducible code.

--
Sent from my phone. Please excuse my brevity.
        [[alternative HTML version deleted]]

______________________________________________
[email protected] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide https://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to