jbbqqf opened a new pull request, #16264: URL: https://github.com/apache/iceberg/pull/16264
## Summary `NullabilityHolder.reset()` previously zeroed `numNulls` but left the `isNull` marker array unchanged. Callers that reuse a holder across batches could see `hasNulls() == false` yet still get `isNullAt(i) == 1` for indices set null in a previous batch. This PR fixes `reset()` to also clear the `isNull` array, and adds focused unit tests covering the single-set, bulk-set, and idempotency paths. Resolves #15808 ## Context Today's vectorized read hot-path callers (e.g. the arrow vectorized readers) always overwrite all positions on the next batch via `setNulls(start, num)` / `setNotNulls(start, num)`, so the latent stale state is never observed in production. The bug is therefore a contract trap rather than an active correctness bug — but the holder's contract documents `reset()` as restoring a clean initial state, and the original triage in #15808 already proposed the one-line fix. This PR is a revival of #15809 (closed by stale-bot at 30 days of no review). The diff is unchanged in spirit; this version adds AI disclosure (which a reviewer asked for on the previous attempt) and a slightly broader test class to make the regression-test evidence explicit. ## Changes - `arrow/src/main/java/org/apache/iceberg/arrow/vectorized/NullabilityHolder.java` — `reset()` now also calls `Arrays.fill(isNull, (byte) 0)`. A short comment explains the why (so a reviewer reading the diff cold doesn't have to re-derive it from the issue). - `arrow/src/test/java/org/apache/iceberg/arrow/vectorized/TestNullabilityHolder.java` — new test class with three tests: `resetClearsIsNullMarkers`, `resetClearsBulkSetNulls`, `resetIsIdempotent`. All three fail on `origin/main` and pass with the fix. The other two arrays in `NullabilityHolder` (`nulls` and `nonNulls`) are immutable `arraycopy` source buffers and intentionally **not** cleared — that's documented in the inline comment. ## Reproduce BEFORE/AFTER yourself (copy-paste) ```bash # --- one-time setup --- git clone https://github.com/apache/iceberg.git /tmp/repro && cd /tmp/repro # --- BEFORE (origin/main) --- git checkout origin/main # Pull in only the new test file from the fix branch so the regression # is visible against unmodified production code: git fetch https://github.com/jbbqqf/iceberg.git feat/15808-fix-nullability-holder-reset git checkout FETCH_HEAD -- arrow/src/test/java/org/apache/iceberg/arrow/vectorized/TestNullabilityHolder.java ./gradlew :iceberg-arrow:test --tests "org.apache.iceberg.arrow.vectorized.TestNullabilityHolder" # Expected: 3 tests, 3 failures — isNullAt() returns the stale 1 from # a setNull/setNulls call that preceded reset(). # --- AFTER (this PR) --- git checkout FETCH_HEAD ./gradlew :iceberg-arrow:test --tests "org.apache.iceberg.arrow.vectorized.TestNullabilityHolder" # Expected: 3 tests, 3 passes — reset() now zeros isNull as well as numNulls. ``` The only thing that changes between BEFORE and AFTER is the checked-out ref of the production class; the test file is the same in both runs. ## What I ran locally - `./gradlew :iceberg-arrow:test --tests "org.apache.iceberg.arrow.vectorized.TestNullabilityHolder"` on `origin/main` with the new test file imported → **3 tests, 3 failures** (regression visible). - Same command on this branch → **3 tests, 3 passed**. - `./gradlew :iceberg-arrow:spotlessCheck` → **PASS**. - Full `:iceberg-arrow:test` build wrapping the change → **BUILD SUCCESSFUL** (existing tests untouched). ## Edge cases tested | # | Scenario | Input | Expected | Verified by | |---|----------|-------|----------|-------------| | 1 | Single `setNull(idx)` then `reset()` | `setNull(3)`, `reset()` | `isNullAt(3) == 0`, all other indices == 0, `numNulls() == 0` | `resetClearsIsNullMarkers` | | 2 | Bulk `setNulls(start, num)` then `reset()` | `setNulls(2, 4)`, `reset()` | every index == 0, `hasNulls() == false` | `resetClearsBulkSetNulls` | | 3 | Idempotency — call `reset()` twice | `setNull(0)`, `reset()`, `reset()` | every index == 0, `numNulls() == 0` | `resetIsIdempotent` | ## Risk / blast radius Internal: `NullabilityHolder` is package-private under `org.apache.iceberg.arrow.vectorized` and only used by the vectorized arrow read path. The fix adds a `byte[]` zero-fill in `reset()` (size = batch capacity, typically 4–8 KiB on hot paths). On the dominant hot path the holder is reused across batches and the next bulk `setNulls`/`setNotNulls` would have overwritten the array anyway, so the only observable effect is a single extra `Arrays.fill` per `reset()` — a few microseconds per batch on typical JVM heuristics. Additive change; no API surface modified. ## Release note ```release-note Fix NullabilityHolder.reset() to clear the isNull marker array, restoring the documented clean-state contract for callers that reuse a holder across batches. ``` --- *PR drafted with assistance from Claude Code. The change was reviewed manually against `apache/iceberg`'s source on `main`. The reproducer block above was used during development; it is the same one a reviewer can paste verbatim.* -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
