Re: [PR] Support `Table.to_arrow_batch_reader` to return RecordBatchReader instead of a fully materialized Arrow Table [iceberg-python]

via GitHub Wed, 12 Jun 2024 19:08:57 -0700


corleyma commented on code in PR #786:
URL: https://github.com/apache/iceberg-python/pull/786#discussion_r1637386352



##########
pyiceberg/io/pyarrow.py:
##########
@@ -1005,36 +1004,46 @@ def _task_to_table(
             columns=[col.name for col in file_project_schema.columns],
         )
 
-        if positional_deletes:
-            # Create the mask of indices that we're interested in
-            indices = _combine_positional_deletes(positional_deletes, 
fragment.count_rows())
-
-            if limit:
-                if pyarrow_filter is not None:
-                    # In case of the filter, we don't exactly know how many 
rows
-                    # we need to fetch upfront, can be optimized in the future:
-                    # https://github.com/apache/arrow/issues/35301
-                    arrow_table = fragment_scanner.take(indices)
-                    arrow_table = arrow_table.filter(pyarrow_filter)
-                    arrow_table = arrow_table.slice(0, limit)
-                else:
-                    arrow_table = fragment_scanner.take(indices[0:limit])
-            else:
-                arrow_table = fragment_scanner.take(indices)
+        current_index = 0
+        batches = fragment_scanner.to_batches()

Review Comment:
   As I review this, it occurs to me that it might be useful to expose options 
related to batching/read ahead, etc, that pyarrow accepts when constructing the 
scanner.  See the [pyarrow 
docs](https://arrow.apache.org/docs/python/generated/pyarrow.dataset.Scanner.html#pyarrow.dataset.Scanner.from_fragment)
 for more details. 
   
   Specifically, I think setting batch_size is probably something that ought to 
be tunable, since the memory pressure will be a function of batch size and the 
number and types of columns in the table.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Support `Table.to_arrow_batch_reader` to return RecordBatchReader instead of a fully materialized Arrow Table [iceberg-python]

Reply via email to