tanmayrauth commented on issue #3498:
URL:
https://github.com/apache/iceberg-python/issues/3498#issuecomment-4701086673
I went through visitors.py and the call sites, and three of the four hold
up. The big one is #1: visit_not_equal and visit_not_in short-circuit on
_can_contain_nulls/_can_contain_nans and return ROWS_MUST_MATCH without ever
checking the bounds, where every other strict method (and Java) uses the
_contains_*_only variants. Since the strict evaluator drives _DeleteFiles
(snapshot.py:460), this isn't just over-pruning — a delete(NotEqualTo("x", 5))
against a file with stats [null, 5] would drop the whole file and take the
legitimate 5 row with it, so it's effectively silent data loss. #3 and #4 are
real too: the ResidualVisitor comparisons hit None < literal and raise
TypeError where the row evaluator safely guards for None, and
visit_not_nan(None) returns AlwaysFalse while the row evaluator treats it as
true — both reachable on the scan path via residual_for. The only one I'd push
back on is #2; the record_count <= 0 behavior matches upstream Java and looks
intentional, though
the comment could be clearer about the -1 case.
If nobody's already on this, I'd like to take it, starting with #1 since
it's the data-loss path, with a regression test asserting the 5 row survives
the delete, then a follow-up for the ResidualVisitor null handling.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]