rdblue commented on code in PR #6775:
URL: https://github.com/apache/iceberg/pull/6775#discussion_r1199482745
##########
python/pyiceberg/io/pyarrow.py:
##########
@@ -721,18 +773,38 @@ def _file_to_table(
fragment_scanner = ds.Scanner.from_fragment(
fragment=fragment,
schema=physical_schema,
- filter=pyarrow_filter,
+ # This will push down the query to Arrow.
+ # But in case there are positional deletes, we have to apply them
first
+ filter=pyarrow_filter if not positional_deletes else None,
columns=[col.name for col in file_project_schema.columns],
)
+ if positional_deletes:
+ # In the case of a mask, it is a bit awkward because we first
Review Comment:
If I understand correctly, the problem is that we are relying on the arrow
result to correspond 1-to-1 with the records in the file so that we can use
position in the DataFrame as the row position in the file.
Seems like this is a big problem. If we push the filter down and it skips
even one row, we will lose the ability to correctly apply the deletes. But if
we need to read deletes, we don't want to read the entire file, which could
mean reading whole row groups that are unnecessary.
I think we can solve this a couple of ways. What we do in Java is project a
column of row positions that we carry through. If we eliminate a whole row
group, we start reading the next one with its starting position determined by
the number of rows in all previous groups. I don't know if Arrow supports this,
but it would need to.
Another option is to project just the filter columns for the entire file. If
the filter is `id > 5 and id < 10`, then project just the `id` column. Then run
the filter to produce a bitmap of selected rows, and `AND` that bitmap with the
position delete bitmap. That strategy can end up reading less data and being
faster, but it would also rely on support in Arrow for efficiently reading a
file with a bitmap selecting rows.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]