wombatu-kun opened a new pull request, #16742:
URL: https://github.com/apache/iceberg/pull/16742

   ## Problem
   
   `DeleteFilter.applyPosDeletes` short-circuits and returns the records 
untouched when there are no position deletes, but `DeleteFilter.applyEqDeletes` 
had no equivalent guard. When a scan has no equality deletes it still built an 
always-false predicate and wrapped every row in a `FilterIterator` (or, in CDC 
`_is_deleted` mode, a per-row `transform`) that evaluated that predicate for no 
effect.
   
   This no-op wrapper sits on the hot path of the most common reads: plain 
scans (no deletes) and position-delete / deletion-vector scans (position 
deletes present, equality deletes empty). A deletion-vector merge-on-read scan 
chained two `FilterIterator`s where one suffices. `DeleteFilter.filter` is 
called unconditionally by the generic reader, Spark `RowDataReader`, Flink 
`RowDataFileScanTaskReader`, and the MR input format, so the fix benefits every 
engine.
   
   ## Change
   
   Add an `eqDeletes.isEmpty()` guard to `applyEqDeletes`, mirroring the 
existing `applyPosDeletes` guard. When there are no equality deletes the 
records are returned unchanged. Behavior is identical: in filter mode the 
always-false predicate kept every row, and in CDC mode the equality 
`markDeleted` marked nothing, so only the redundant iterator is removed.
   
   ## Benchmark
   
   Adds `DeleteFilterBenchmark`; the data module previously had no delete-path 
benchmark. 2.5M rows, 5% delete ratio, single-shot ms/op, lower is better. 
`filterOnly` applies the filter to pre-materialized records to isolate it from 
Parquet decode; `scan` is end to end. EQUALITY is an unchanged control.
   
   | view | scenario | before | after | delta |
   | --- | --- | --- | --- | --- |
   | filterOnly | NONE | 57.8 | 15.6 | -73% |
   | filterOnly | POSITION | 184.6 | 129.7 | -30% |
   | filterOnly | EQUALITY (control) | 163.4 | 164.8 | +0.8% |
   | scan | NONE | 664 | 611 | -7.9% |
   | scan | POSITION | 869 | 809 | -6.9% |
   | scan | EQUALITY (control) | 801 | 870 | within noise |
   
   The isolated cost of a no-op `FilterIterator` is roughly 17-22 ns/row of 
state-machine indirection; end to end this is diluted by decode to about 7% on 
plain and position-delete scans. The equality-delete path is unchanged, since 
its branch is never taken when equality deletes are present.
   
   Verified with the existing generic delete read tests 
(`TestGenericReaderDeletes`, 44 cases across Parquet/Avro/ORC and format 
versions 2 and 3, including the position-only, mixed, and `_is_deleted` CDC 
paths) and `TestLocalScan`.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to