Here is something to try. This does use a feature of xan
(https://github.com/medialab/xan)
in the development version so you will need to build it using cargo.
cat("cat rows --paths - | filter 'origin eq `JFK`'", file = "jfk.xan")
dat <- fread(cmd = "ls *_flights.csv | xan run -f jfk.xan -")
On Mon, May 25, 2026 at 7:46 AM Jan van der Laan <[email protected]> wrote:
>
>
>
>
> On 5/25/26 04:46, Naresh Gurbuxani wrote:
>
> >>
> >> " If all the data were in a few files, then in memory duckdb would work."
> >>
> > I only need a subset of data at any time. Duckdb allows a virtual table
> > for each file. This not practical with thousands of files. With a few
> > large files, this can work. Here the goal is to establish a connection,
> > not to load all data at once.
>
> It the files have the same columns, you can also also open all files
> into one virtual database using duckdb. The code below creates a virtual
> table view called 'flights' with the data from all csv files in data/.
>
> con <- duckdb::dbConnect(duckdb::duckdb())
>
> sql <- paste0("CREATE OR REPLACE VIEW flights AS "
> "SELECT * FROM read_csv('data/**/*.csv');")
> DBI::dbExecute(con, sql)
>
> dbListTables(con)
>
> dbGetQuery(con, "SELECT * FROM flights;")
>
>
> duckdb is fast and will do things in parallel, but for every query it
> will have to go through all files. Going through 200GB of data will take
> time. So, if you have to query the data repeatedly it is probably going
> to speed up your code significantly if you resave your data in another
> format.
>
> HTH,
>
> Jan
>
> ______________________________________________
> [email protected] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide https://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
--
Statistics & Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com
______________________________________________
[email protected] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide https://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.