[I] Streaming LIST data over ADBC [arrow-adbc]

via GitHub Wed, 07 Aug 2024 11:19:10 -0700


danielballan opened a new issue, #2066:
URL: https://github.com/apache/arrow-adbc/issues/2066


   ### What would you like help with?
   
   Hello @paleolimbot, we briefly spoke at SciPy a month ago. @cpcloud and 
@gforsyth sent me your way.
   
   My colleauge @skarakuzu and I have the following application:
   
   ```
   Science Experiment ---> ~10Hz rows of data over HTTP ---> Database
   ```
   
   In our use case, we are interested in reading partial datasets while an 
experiment is still ongoing, but we are not especially sensitive to real-time 
access; a little lag is acceptable. We would like to use Postgres for 
facility-scale "production" and SQLite for small deployments and dev/test.
   
   We think it makes sense to use ADBC to get the data into Postgres and SQLite.
   
   Most of our data is simple tabular data with basic types. This is a simple 
sketch that works:
   
   
   ```python
   import os
   
   import pandas
   import pyarrow
   
   import adbc_driver_postgresql.dbapi
   import adbc_driver_sqlite.dbapi
   
   
   conn = adbc_driver_sqlite.dbapi.connect("test.sqlite")
   with conn.cursor() as cur:
       cur.execute("DROP TABLE IF EXISTS example")
   
   
   dfs = [
       pandas.DataFrame({"A": [1, 2, 3], "B": [10., 20., 30.]}),
       pandas.DataFrame({"A": [4, 5, 6], "B": [40., 50., 60.]}),
   ]
   
   # Create a table and append to it.
   with conn.cursor() as cur:
       for df in dfs:
           table = pyarrow.Table.from_pandas(df)
           cur.adbc_ingest("example", table, mode="create_append")
           cur.execute("SELECT * FROM example")
           print(cur.fetchall())
   conn.close()
   ```
   
   In rare cases, we have detectors that produce a small "waveform" of data, a 
small variable-length list, alongisde some readings, like this:
   
   ```python
   dfs = [
       pandas.DataFrame({"A": [[1, 2, 3], [1, 2], [1, 2, 3, 4]], "B": [10., 
20., 30.]}),
       pandas.DataFrame({"A": [[4, 5, 6], [4, 5, 6, 7], [4]], "B": [40., 50., 
60.]}),
   ]
   ```
   
   Expand for a complete runnable example with this data:
   
   <details>
   
   ```python
   import os
   
   import pandas
   import pyarrow
   
   import adbc_driver_postgresql.dbapi
   import adbc_driver_sqlite.dbapi
   
   
   conn = adbc_driver_sqlite.dbapi.connect("test.sqlite")
   with conn.cursor() as cur:
       cur.execute("DROP TABLE IF EXISTS example")
   
   
   dfs = [
       pandas.DataFrame({"A": [[1, 2, 3], [1, 2], [1, 2, 3, 4]], "B": [10., 
20., 30.]}),
       pandas.DataFrame({"A": [[4, 5, 6], [4, 5, 6, 7], [4]], "B": [40., 50., 
60.]}),
   ]
   
   # Create a table and append to it.
   with conn.cursor() as cur:
       for df in dfs:
           table = pyarrow.Table.from_pandas(df)
           cur.adbc_ingest("example", table, mode="create_append")
           cur.execute("SELECT * FROM example")
           print(cur.fetchall())
   conn.close()
   ```
   
   </details>
   
   As you would expect this fails with a clear error message:
   
   ```
   adbc_driver_manager.NotSupportedError: NOT_IMPLEMENTED: Column 0 has 
unsupported type list
   ```
   
   It looks like support for lists was recently added for Postgres in 
https://github.com/apache/arrow-adbc/pull/1962 but not yet released. We have 
tried without success to cobble together a C++ example inserting lists into 
PostgreSQL on `main`. I wonder if you could sketch a short runnable example to 
help us get started. (Let us know if you want to see what we've tried.) We 
would happy to contribute to the C++ documentation for ADBC once get something 
working.
   
   Once #1962 is released, should we expect lists to work from Python as well, 
as it wraps the C/C++, or are more changes needed on the Python side to make 
that work?
   
   Finally, we're interested in adding support for lists in SQLite. As SQLite 
is loose about types, perhaps one reasonable way to do this would be with a 
JSON column. I wonder what you think of that approach.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[I] Streaming LIST data over ADBC [arrow-adbc]

Reply via email to