[GitHub] [iceberg] rdblue commented on a diff in pull request #6775: Python: Add positional deletes

via GitHub Fri, 19 May 2023 15:45:45 -0700


rdblue commented on code in PR #6775:
URL: https://github.com/apache/iceberg/pull/6775#discussion_r1199482745



##########
python/pyiceberg/io/pyarrow.py:
##########
@@ -721,18 +773,38 @@ def _file_to_table(
         fragment_scanner = ds.Scanner.from_fragment(
             fragment=fragment,
             schema=physical_schema,
-            filter=pyarrow_filter,
+            # This will push down the query to Arrow.
+            # But in case there are positional deletes, we have to apply them 
first
+            filter=pyarrow_filter if not positional_deletes else None,
             columns=[col.name for col in file_project_schema.columns],
         )
 
+        if positional_deletes:
+            # In the case of a mask, it is a bit awkward because we first

Review Comment:
   If I understand correctly, the problem is that we are relying on the arrow 
result to correspond 1-to-1 with the records in the file so that we can use 
position in the DataFrame as the row position in the file.
   
   Seems like this is a big problem. If we push the filter down and it skips 
even one row, we will lose the ability to correctly apply the deletes. But if 
we need to read deletes, we don't want to read the entire file, which could 
mean reading whole row groups that are unnecessary.
   
   I think we can solve this a couple of ways. What we do in Java is project a 
column of row positions that we carry through. If we eliminate a whole row 
group, we start reading the next one with its starting position determined by 
the number of rows in all previous groups. I don't know if Arrow supports this, 
but it would need to.
   
   Another option is to project just the filter columns for the entire file. If 
the filter is `id > 5 and id < 10`, then project just the `id` column. Then run 
the filter to produce a bitmap of selected rows, and `AND` that bitmap with the 
position delete bitmap. That strategy can end up reading less data and being 
faster, but it would also rely on support in Arrow for efficiently reading a 
file with a bitmap selecting rows.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] rdblue commented on a diff in pull request #6775: Python: Add positional deletes

Reply via email to