gabeiglio commented on code in PR #1621:
URL: https://github.com/apache/iceberg-python/pull/1621#discussion_r1948224048


##########
pyiceberg/io/pyarrow.py:
##########
@@ -1359,14 +1359,11 @@ def _task_to_record_batches(
 
                 # Apply the user filter
                 if pyarrow_filter is not None:
-                    # we need to switch back and forth between RecordBatch and 
Table
-                    # as Expression filter isn't yet supported in RecordBatch
-                    # https://github.com/apache/arrow/issues/39220
-                    arrow_table = pa.Table.from_batches([batch])
-                    arrow_table = arrow_table.filter(pyarrow_filter)
-                    if len(arrow_table) == 0:
+                    if batch.num_rows == 0:
                         continue
-                    output_batches = arrow_table.to_batches()
+                    filtered_batch = batch.filter(pyarrow_filter)
+                    if filtered_batch.num_rows > 0:
+                        output_batches = [filtered_batch]

Review Comment:
   I like this approach, the only thing is that I think there is a bug in 
pyarrow filter function that fails when trying to filter an empty batch 
[_filter_table](https://github.com/apache/arrow/blob/e32f56b478171fc4b53dc2042c4cf5d37c97e351/python/pyarrow/acero.py#L377)
   
   So with this approach, If there are no positional deletes but we do have a 
pyarrow_filter we should also check first for empty batches before filtering, 
doing something like this:
   ```python
   # Apply user filter if needed
   if pyarrow_filter is not None:
       # filter fails if we pass an empty batch
       if current_batch.num_rows > 0:
           current_batch = current_batch.filter(pyarrow_filter)
       # Skip empty batches
       if current_batch.num_rows == 0:
           continue
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to