0) [arrow-adbc]

via GitHub Mon, 18 May 2026 03:29:10 -0700


mediprtl opened a new issue, #4319:
URL: https://github.com/apache/arrow-adbc/issues/4319


   ### What happened?
   
   `PostgresCopyListFieldWriter::Write` (`c/driver/postgresql/copy/writer.h`, 
both the `IsFixedSize` and variable-length branches) computes the child range 
for each row from the *logical* row index without adding `array_view_->offset`. 
When the parent `List` / `LargeList` / `FixedSizeList` array has `offset > 0` 
(a sliced parent), the writer reads the wrong slot of the offsets buffer — or, 
for fixed-size, multiplies the wrong base index by the element size. The 
resulting child ranges still index into the still-full child values buffer, so 
list elements end up attached to the wrong rows.
   
   Practical impact: silent, per-row drift of list-column values when an Arrow 
table is sliced into multiple batches and ingested via `adbc_ingest` with 
`mode="create"` then `mode="append"`. The first chunk (`offset=0`) is always 
correct; every subsequent chunk's list/array column is shifted by the chunk's 
`parent.offset`. Scalar columns are unaffected because their writers route 
through `ArrowArrayViewGetInt*`, which honors `offset`.
   
   Reproduced end-to-end on the `postgres-test` service from this repo's 
`compose.yaml` against `adbc-driver-postgresql` 1.11.0, with pyarrow 23 and 24, 
for `list<string>`, `large_list<string>`, and `fixed_size_list<string, 2>`.
   
   ### Stack Trace
   
   No exception — silent data corruption.
   
   ### How can we reproduce the bug?
   
   `docker compose up --detach --wait postgres-test`, then:
   
   ```python
   import pyarrow as pa
   from adbc_driver_postgresql import dbapi
   
   URI = "postgresql://postgres:password@localhost:5432/postgres"
   N, batch = 6, 3
   
   def expected(i):
       # variable length so any drift breaks structure, not just values
       return [f"r{i}-a", f"r{i}-b"] if i % 2 == 0 else [f"r{i}-x"]
   
   tbl = pa.table({
       "pk":   pa.array(list(range(N))),
       "tags": pa.array([expected(i) for i in range(N)],
                        type=pa.large_list(pa.string())),
   })
   
   with dbapi.connect(URI) as conn, conn.cursor() as cur:
       cur.execute("DROP TABLE IF EXISTS adbc_list_bug")
       for i, off in enumerate(range(0, N, batch)):
           cur.adbc_ingest("adbc_list_bug", tbl.slice(off, batch),
                           mode="create" if i == 0 else "append")
       cur.execute("SELECT pk, tags FROM adbc_list_bug ORDER BY pk")
       for pk, tags in cur.fetchall():
           print(pk, tags, "OK" if tags == expected(pk) else "DRIFTED")
   ```
   
   Observed: pk 0–2 correct, pk 3–5 drifted. Repeats identically with 
`pa.list_(pa.string())` and `pa.list_(pa.string(), 2)`.
   
   The variable-length structure also drifts — pk=3 (expected 1 element) 
receives the 2-element value from row 0, which nails the diagnosis to "the 
offsets buffer is being misread" rather than "child values are shifted 
independently."
   
   ### Root cause
   
   `c/driver/postgresql/copy/writer.h`, `PostgresCopyListFieldWriter::Write` 
(both template branches) use the logical `index` directly:
   
   ```cpp
   if constexpr (IsFixedSize) {
     start = index * array_view_->layout.child_size_elements;
     end   = start + array_view_->layout.child_size_elements;
   } else {
     start = ArrowArrayViewListChildOffset(array_view_, index);
     end   = ArrowArrayViewListChildOffset(array_view_, index + 1);
   }
   ```
   
   `ArrowArrayViewListChildOffset` (nanoarrow `inline_array.h`) is, unlike its 
sibling `ArrowArrayViewGetIntUnsafe`, *not* offset-aware — it reads 
`buffer_views[1].data.as_int32[i]` (or `as_int64`) directly. And the fixed-size 
branch never references `offset` either. So both branches misbehave when 
`array_view_->offset > 0`.
   
   PyArrow's `Table.slice(off, len)` produces parent `List` / `FixedSizeList` 
arrays with `array.offset = off`, sharing the original offsets/child buffers, 
so any multi-batch `adbc_ingest` path (or any caller passing a sliced source) 
trips this.
   
   ### Suggested fix
   
   ```cpp
   const int64_t logical = array_view_->offset + index;
   if constexpr (IsFixedSize) {
     start = logical * array_view_->layout.child_size_elements;
     end   = start + array_view_->layout.child_size_elements;
   } else {
     start = ArrowArrayViewListChildOffset(array_view_, logical);
     end   = ArrowArrayViewListChildOffset(array_view_, logical + 1);
   }
   ```
   
   Built locally with that change, `.so` swapped into the unmodified wheel — 
all three list types ingest correctly across multi-chunk slices in the same 
venv where the unpatched wheel drifts.
   
   ### Workaround (driver-user side)
   
   Pass non-sliced inputs only. `Table.combine_chunks()` and 
`pa.concat_tables([sliced])` are **not** sufficient — they short-circuit for a 
single-chunk slice and preserve `offset > 0`. Per-column 
`ChunkedArray.combine_chunks()` (or `Table.from_arrays([c.combine_chunks() for 
c in t.columns], names=…)`) does materialize and reset offsets to 0.
   
   ### Environment/Setup
   
   - `adbc-driver-postgresql` 1.11.0 (also reproduces on `main`)
   - pyarrow 23.0.1 and 24.0.0
   - macOS arm64; the postgres-test container from this repo's `compose.yaml`
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] c/driver/postgresql: adbc_ingest silently misaligns list/large_list/fixed_size_list rows when the source Arrow array is sliced (parent.offset > 0) [arrow-adbc]

Reply via email to