jbbqqf opened a new pull request, #16264:
URL: https://github.com/apache/iceberg/pull/16264

   ## Summary
   
   `NullabilityHolder.reset()` previously zeroed `numNulls` but left the 
`isNull` marker array unchanged. Callers that reuse a holder across batches 
could see `hasNulls() == false` yet still get `isNullAt(i) == 1` for indices 
set null in a previous batch.
   
   This PR fixes `reset()` to also clear the `isNull` array, and adds focused 
unit tests covering the single-set, bulk-set, and idempotency paths.
   
   Resolves #15808
   
   ## Context
   
   Today's vectorized read hot-path callers (e.g. the arrow vectorized readers) 
always overwrite all positions on the next batch via `setNulls(start, num)` / 
`setNotNulls(start, num)`, so the latent stale state is never observed in 
production. The bug is therefore a contract trap rather than an active 
correctness bug — but the holder's contract documents `reset()` as restoring a 
clean initial state, and the original triage in #15808 already proposed the 
one-line fix.
   
   This PR is a revival of #15809 (closed by stale-bot at 30 days of no 
review). The diff is unchanged in spirit; this version adds AI disclosure 
(which a reviewer asked for on the previous attempt) and a slightly broader 
test class to make the regression-test evidence explicit.
   
   ## Changes
   
   - 
`arrow/src/main/java/org/apache/iceberg/arrow/vectorized/NullabilityHolder.java`
 — `reset()` now also calls `Arrays.fill(isNull, (byte) 0)`. A short comment 
explains the why (so a reviewer reading the diff cold doesn't have to re-derive 
it from the issue).
   - 
`arrow/src/test/java/org/apache/iceberg/arrow/vectorized/TestNullabilityHolder.java`
 — new test class with three tests: `resetClearsIsNullMarkers`, 
`resetClearsBulkSetNulls`, `resetIsIdempotent`. All three fail on `origin/main` 
and pass with the fix.
   
   The other two arrays in `NullabilityHolder` (`nulls` and `nonNulls`) are 
immutable `arraycopy` source buffers and intentionally **not** cleared — that's 
documented in the inline comment.
   
   ## Reproduce BEFORE/AFTER yourself (copy-paste)
   
   ```bash
   # --- one-time setup ---
   git clone https://github.com/apache/iceberg.git /tmp/repro && cd /tmp/repro
   
   # --- BEFORE (origin/main) ---
   git checkout origin/main
   # Pull in only the new test file from the fix branch so the regression
   # is visible against unmodified production code:
   git fetch https://github.com/jbbqqf/iceberg.git 
feat/15808-fix-nullability-holder-reset
   git checkout FETCH_HEAD -- 
arrow/src/test/java/org/apache/iceberg/arrow/vectorized/TestNullabilityHolder.java
   ./gradlew :iceberg-arrow:test --tests 
"org.apache.iceberg.arrow.vectorized.TestNullabilityHolder"
   # Expected: 3 tests, 3 failures — isNullAt() returns the stale 1 from
   #           a setNull/setNulls call that preceded reset().
   
   # --- AFTER (this PR) ---
   git checkout FETCH_HEAD
   ./gradlew :iceberg-arrow:test --tests 
"org.apache.iceberg.arrow.vectorized.TestNullabilityHolder"
   # Expected: 3 tests, 3 passes — reset() now zeros isNull as well as numNulls.
   ```
   
   The only thing that changes between BEFORE and AFTER is the checked-out ref 
of the production class; the test file is the same in both runs.
   
   ## What I ran locally
   
   - `./gradlew :iceberg-arrow:test --tests 
"org.apache.iceberg.arrow.vectorized.TestNullabilityHolder"` on `origin/main` 
with the new test file imported → **3 tests, 3 failures** (regression visible).
   - Same command on this branch → **3 tests, 3 passed**.
   - `./gradlew :iceberg-arrow:spotlessCheck` → **PASS**.
   - Full `:iceberg-arrow:test` build wrapping the change → **BUILD 
SUCCESSFUL** (existing tests untouched).
   
   ## Edge cases tested
   
   | # | Scenario | Input | Expected | Verified by |
   |---|----------|-------|----------|-------------|
   | 1 | Single `setNull(idx)` then `reset()` | `setNull(3)`, `reset()` | 
`isNullAt(3) == 0`, all other indices == 0, `numNulls() == 0` | 
`resetClearsIsNullMarkers` |
   | 2 | Bulk `setNulls(start, num)` then `reset()` | `setNulls(2, 4)`, 
`reset()` | every index == 0, `hasNulls() == false` | `resetClearsBulkSetNulls` 
|
   | 3 | Idempotency — call `reset()` twice | `setNull(0)`, `reset()`, 
`reset()` | every index == 0, `numNulls() == 0` | `resetIsIdempotent` |
   
   ## Risk / blast radius
   
   Internal: `NullabilityHolder` is package-private under 
`org.apache.iceberg.arrow.vectorized` and only used by the vectorized arrow 
read path. The fix adds a `byte[]` zero-fill in `reset()` (size = batch 
capacity, typically 4–8 KiB on hot paths). On the dominant hot path the holder 
is reused across batches and the next bulk `setNulls`/`setNotNulls` would have 
overwritten the array anyway, so the only observable effect is a single extra 
`Arrays.fill` per `reset()` — a few microseconds per batch on typical JVM 
heuristics.
   
   Additive change; no API surface modified.
   
   ## Release note
   
   ```release-note
   Fix NullabilityHolder.reset() to clear the isNull marker array, restoring 
the documented clean-state contract for callers that reuse a holder across 
batches.
   ```
   
   ---
   
   *PR drafted with assistance from Claude Code. The change was reviewed 
manually against `apache/iceberg`'s source on `main`. The reproducer block 
above was used during development; it is the same one a reviewer can paste 
verbatim.*
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to