gabeiglio commented on code in PR #1621: URL: https://github.com/apache/iceberg-python/pull/1621#discussion_r1948224048
########## pyiceberg/io/pyarrow.py: ########## @@ -1359,14 +1359,11 @@ def _task_to_record_batches( # Apply the user filter if pyarrow_filter is not None: - # we need to switch back and forth between RecordBatch and Table - # as Expression filter isn't yet supported in RecordBatch - # https://github.com/apache/arrow/issues/39220 - arrow_table = pa.Table.from_batches([batch]) - arrow_table = arrow_table.filter(pyarrow_filter) - if len(arrow_table) == 0: + if batch.num_rows == 0: continue - output_batches = arrow_table.to_batches() + filtered_batch = batch.filter(pyarrow_filter) + if filtered_batch.num_rows > 0: + output_batches = [filtered_batch] Review Comment: I like this approach, the only thing is that I think there is a bug in pyarrow filter function that fails when trying to filter an empty batch [_filter_table](https://github.com/apache/arrow/blob/e32f56b478171fc4b53dc2042c4cf5d37c97e351/python/pyarrow/acero.py#L377) So with this approach, If there are no positional deletes but we do have a pyarrow_filter we should also check first for empty batches before filtering, doing something like this: ```python # Apply user filter if needed if pyarrow_filter is not None: # filter fails if we pass an empty batch if current_batch.num_rows > 0: current_batch = current_batch.filter(pyarrow_filter) # Skip empty batches if current_batch.num_rows == 0: continue ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org