mediprtl opened a new issue, #4319:
URL: https://github.com/apache/arrow-adbc/issues/4319
### What happened?
`PostgresCopyListFieldWriter::Write` (`c/driver/postgresql/copy/writer.h`,
both the `IsFixedSize` and variable-length branches) computes the child range
for each row from the *logical* row index without adding `array_view_->offset`.
When the parent `List` / `LargeList` / `FixedSizeList` array has `offset > 0`
(a sliced parent), the writer reads the wrong slot of the offsets buffer — or,
for fixed-size, multiplies the wrong base index by the element size. The
resulting child ranges still index into the still-full child values buffer, so
list elements end up attached to the wrong rows.
Practical impact: silent, per-row drift of list-column values when an Arrow
table is sliced into multiple batches and ingested via `adbc_ingest` with
`mode="create"` then `mode="append"`. The first chunk (`offset=0`) is always
correct; every subsequent chunk's list/array column is shifted by the chunk's
`parent.offset`. Scalar columns are unaffected because their writers route
through `ArrowArrayViewGetInt*`, which honors `offset`.
Reproduced end-to-end on the `postgres-test` service from this repo's
`compose.yaml` against `adbc-driver-postgresql` 1.11.0, with pyarrow 23 and 24,
for `list<string>`, `large_list<string>`, and `fixed_size_list<string, 2>`.
### Stack Trace
No exception — silent data corruption.
### How can we reproduce the bug?
`docker compose up --detach --wait postgres-test`, then:
```python
import pyarrow as pa
from adbc_driver_postgresql import dbapi
URI = "postgresql://postgres:password@localhost:5432/postgres"
N, batch = 6, 3
def expected(i):
# variable length so any drift breaks structure, not just values
return [f"r{i}-a", f"r{i}-b"] if i % 2 == 0 else [f"r{i}-x"]
tbl = pa.table({
"pk": pa.array(list(range(N))),
"tags": pa.array([expected(i) for i in range(N)],
type=pa.large_list(pa.string())),
})
with dbapi.connect(URI) as conn, conn.cursor() as cur:
cur.execute("DROP TABLE IF EXISTS adbc_list_bug")
for i, off in enumerate(range(0, N, batch)):
cur.adbc_ingest("adbc_list_bug", tbl.slice(off, batch),
mode="create" if i == 0 else "append")
cur.execute("SELECT pk, tags FROM adbc_list_bug ORDER BY pk")
for pk, tags in cur.fetchall():
print(pk, tags, "OK" if tags == expected(pk) else "DRIFTED")
```
Observed: pk 0–2 correct, pk 3–5 drifted. Repeats identically with
`pa.list_(pa.string())` and `pa.list_(pa.string(), 2)`.
The variable-length structure also drifts — pk=3 (expected 1 element)
receives the 2-element value from row 0, which nails the diagnosis to "the
offsets buffer is being misread" rather than "child values are shifted
independently."
### Root cause
`c/driver/postgresql/copy/writer.h`, `PostgresCopyListFieldWriter::Write`
(both template branches) use the logical `index` directly:
```cpp
if constexpr (IsFixedSize) {
start = index * array_view_->layout.child_size_elements;
end = start + array_view_->layout.child_size_elements;
} else {
start = ArrowArrayViewListChildOffset(array_view_, index);
end = ArrowArrayViewListChildOffset(array_view_, index + 1);
}
```
`ArrowArrayViewListChildOffset` (nanoarrow `inline_array.h`) is, unlike its
sibling `ArrowArrayViewGetIntUnsafe`, *not* offset-aware — it reads
`buffer_views[1].data.as_int32[i]` (or `as_int64`) directly. And the fixed-size
branch never references `offset` either. So both branches misbehave when
`array_view_->offset > 0`.
PyArrow's `Table.slice(off, len)` produces parent `List` / `FixedSizeList`
arrays with `array.offset = off`, sharing the original offsets/child buffers,
so any multi-batch `adbc_ingest` path (or any caller passing a sliced source)
trips this.
### Suggested fix
```cpp
const int64_t logical = array_view_->offset + index;
if constexpr (IsFixedSize) {
start = logical * array_view_->layout.child_size_elements;
end = start + array_view_->layout.child_size_elements;
} else {
start = ArrowArrayViewListChildOffset(array_view_, logical);
end = ArrowArrayViewListChildOffset(array_view_, logical + 1);
}
```
Built locally with that change, `.so` swapped into the unmodified wheel —
all three list types ingest correctly across multi-chunk slices in the same
venv where the unpatched wheel drifts.
### Workaround (driver-user side)
Pass non-sliced inputs only. `Table.combine_chunks()` and
`pa.concat_tables([sliced])` are **not** sufficient — they short-circuit for a
single-chunk slice and preserve `offset > 0`. Per-column
`ChunkedArray.combine_chunks()` (or `Table.from_arrays([c.combine_chunks() for
c in t.columns], names=…)`) does materialize and reset offsets to 0.
### Environment/Setup
- `adbc-driver-postgresql` 1.11.0 (also reproduces on `main`)
- pyarrow 23.0.1 and 24.0.0
- macOS arm64; the postgres-test container from this repo's `compose.yaml`
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]