wombatu-kun opened a new pull request, #16742: URL: https://github.com/apache/iceberg/pull/16742
## Problem `DeleteFilter.applyPosDeletes` short-circuits and returns the records untouched when there are no position deletes, but `DeleteFilter.applyEqDeletes` had no equivalent guard. When a scan has no equality deletes it still built an always-false predicate and wrapped every row in a `FilterIterator` (or, in CDC `_is_deleted` mode, a per-row `transform`) that evaluated that predicate for no effect. This no-op wrapper sits on the hot path of the most common reads: plain scans (no deletes) and position-delete / deletion-vector scans (position deletes present, equality deletes empty). A deletion-vector merge-on-read scan chained two `FilterIterator`s where one suffices. `DeleteFilter.filter` is called unconditionally by the generic reader, Spark `RowDataReader`, Flink `RowDataFileScanTaskReader`, and the MR input format, so the fix benefits every engine. ## Change Add an `eqDeletes.isEmpty()` guard to `applyEqDeletes`, mirroring the existing `applyPosDeletes` guard. When there are no equality deletes the records are returned unchanged. Behavior is identical: in filter mode the always-false predicate kept every row, and in CDC mode the equality `markDeleted` marked nothing, so only the redundant iterator is removed. ## Benchmark Adds `DeleteFilterBenchmark`; the data module previously had no delete-path benchmark. 2.5M rows, 5% delete ratio, single-shot ms/op, lower is better. `filterOnly` applies the filter to pre-materialized records to isolate it from Parquet decode; `scan` is end to end. EQUALITY is an unchanged control. | view | scenario | before | after | delta | | --- | --- | --- | --- | --- | | filterOnly | NONE | 57.8 | 15.6 | -73% | | filterOnly | POSITION | 184.6 | 129.7 | -30% | | filterOnly | EQUALITY (control) | 163.4 | 164.8 | +0.8% | | scan | NONE | 664 | 611 | -7.9% | | scan | POSITION | 869 | 809 | -6.9% | | scan | EQUALITY (control) | 801 | 870 | within noise | The isolated cost of a no-op `FilterIterator` is roughly 17-22 ns/row of state-machine indirection; end to end this is diluted by decode to about 7% on plain and position-delete scans. The equality-delete path is unchanged, since its branch is never taken when equality deletes are present. Verified with the existing generic delete read tests (`TestGenericReaderDeletes`, 44 cases across Parquet/Avro/ORC and format versions 2 and 3, including the position-only, mixed, and `_is_deleted` CDC paths) and `TestLocalScan`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
