Re: [PR] Fix: Handle Empty RecordBatch within `_task_to_record_batches` [iceberg-python]

via GitHub Thu, 08 Aug 2024 14:18:15 -0700


sungwy commented on code in PR #1026:
URL: https://github.com/apache/iceberg-python/pull/1026#discussion_r1710310025



##########
pyiceberg/io/pyarrow.py:
##########
@@ -1249,11 +1251,12 @@ def _task_to_record_batches(
                     # https://github.com/apache/arrow/issues/39220
                     arrow_table = pa.Table.from_batches([batch])
                     arrow_table = arrow_table.filter(pyarrow_filter)
+                    if len(arrow_table) == 0:
+                        continue
                     batch = arrow_table.to_batches()[0]
             yield _to_requested_schema(
                 projected_schema, file_project_schema, batch, 
downcast_ns_timestamp_to_us=True, use_large_types=use_large_types
             )
-            current_index += len(batch)

Review Comment:
   When working on fixing https://github.com/apache/iceberg-python/issues/1024 
I realized a correctness issue was introduced here because we are using the 
length of the filtered batch instead of the original one when tracking the 
`current_index`. I think it'll be crucial to get this fix in with 0.7.1 as soon 
as possible to support our MOR users



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Fix: Handle Empty RecordBatch within `_task_to_record_batches` [iceberg-python]

Reply via email to