I have approximately ten thousand csv files with identical columns and
formats. I want to run some SQL queries on a virtual database, where all
of these files are treated as one table. While it is possible to run
SQL query, dbListTables() does not show this table. Is it possible to
list all tables including those created from arrow FileSystem?
Is it possible to achieve this result without arrow package?
# Create example data
library(data.table)
data("flights", package = "nycflights13")
fwrite(flights[(origin == "EWR")], "data/flights/ewr_flights.csv")
fwrite(flights[(origin == "JFK")], "data/flights/jfk_flights.csv")
fwrite(flights[(origin == "LGA")], "data/flights/lga_flights.csv")
data("airports", package = "nycflights13")
fwrite(airports, "data/airports.csv")
# Verify data saved as intended
dir("data")
[1] "airports.csv" "flights"
dir("data/flights/")
[1] "ewr_flights.csv" "jfk_flights.csv" "lga_flights.csv"
# Create virtual database with two tables
library(arrow)
library(duckdb)
# csv file successfully registed as a table
con <- dbConnect(duckdb())
duckdb_read_csv(con, "airports", "data/airports.csv")
dbListTables(con)
[1] "airports"
# flights_arrow does not show up as a table
flights_arrow <- open_csv_dataset("data/flights")
duckdb_register_arrow(con, "flights", flights_arrow)
dbListTables(con)
[1] "airports"
dbGetQuery(con, "SELECT table_name FROM information_schema.tables;")
table_name
1 airports
# SQL queries can be run on flights table
dbGetQuery(con, "SELECT * FROM flights LIMIT 2;")
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
arr_delay carrier
1 2013 1 1 517 515 2 830 819
11 UA
2 2013 1 1 554 558 -4 740 728
12 UA
flight tailnum origin dest air_time distance hour minute time_hour
1 1545 N14228 EWR IAH 227 1400 5 15 2013-01-01 10:00:00
2 1696 N39463 EWR ORD 150 719 5 58 2013-01-01 10:00:00
______________________________________________
[email protected] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide https://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.