jeremybarner opened a new pull request, #1183:
URL: https://github.com/apache/iceberg-go/pull/1183
## Problem
`processPositionalDeletes` applies positional (merge-on-read) deletes one
Arrow record batch at a time, but `combinePositionalDeletes` builds the
surviving-row indices in **global, file-relative** coordinates `[start, end)`
and passes them straight into `compute.Take` against the **current batch**,
whose valid index range is `[0, NumRows)`.
For the first batch of a data file `start == 0`, so it works. For the second
and later batches `start > 0`, every index is `>= NumRows` and the scan fails:
```
index error: <N> out of bounds
```
where `<N>` is a multiple of the parquet read batch size
(`read.parquet.batch-size`, default `1<<17 == 131072`). So any data file larger
than one batch that also has positional delete files fails to scan at its
second batch.
## Fix
Rebase the Take indices to batch-local coordinates (`i - start`). The
`deletes` set stays in global coordinates because it is matched against the
global position `i`.
```go
for i := start; i < end; i++ {
if _, ok := deletes[i]; !ok {
bldr.Append(i - start)
}
}
```
## Test
Adds `TestProcessPositionalDeletesAcrossBatches`, which feeds two
consecutive batches with a delete located in the **second** batch — the case
the previous code got wrong. The test fails with `index error: 4 out of bounds`
before the fix and passes after.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]