kevinjqliu commented on code in PR #1141: URL: https://github.com/apache/iceberg-python/pull/1141#discussion_r1759604254
########## pyiceberg/io/pyarrow.py: ########## @@ -1251,10 +1253,17 @@ def _task_to_record_batches( arrow_table = arrow_table.filter(pyarrow_filter) if len(arrow_table) == 0: continue - batch = arrow_table.to_batches()[0] - yield _to_requested_schema( - projected_schema, file_project_schema, batch, downcast_ns_timestamp_to_us=True, use_large_types=use_large_types - ) + output_batches = arrow_table.to_batches() + else: + output_batches = iter([batch]) + for output_batch in output_batches: Review Comment: I'm a bit concerned about all the nested for-loops in this function. But correctness comes first and we can always refactor later. ########## pyiceberg/io/pyarrow.py: ########## @@ -1251,10 +1253,17 @@ def _task_to_record_batches( arrow_table = arrow_table.filter(pyarrow_filter) if len(arrow_table) == 0: continue - batch = arrow_table.to_batches()[0] - yield _to_requested_schema( - projected_schema, file_project_schema, batch, downcast_ns_timestamp_to_us=True, use_large_types=use_large_types - ) + output_batches = arrow_table.to_batches() + else: + output_batches = iter([batch]) Review Comment: nit: redundant with `output_batches` already initialized to the same value ########## pyiceberg/io/pyarrow.py: ########## @@ -1238,10 +1238,12 @@ def _task_to_record_batches( for batch in batches: next_index = next_index + len(batch) current_index = next_index - len(batch) + output_batches = iter([batch]) if positional_deletes: # Create the mask of indices that we're interested in indices = _combine_positional_deletes(positional_deletes, current_index, current_index + len(batch)) batch = batch.take(indices) + # Apply the user filter if pyarrow_filter is not None: Review Comment: a little unrelated to this PR, but I wonder if there's ever a case where we need to filter (with `pyarrow_filter`) but `positional_deletes` is None/empty/falsey. I can see that as a potential issue -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org