Re: [PR] Optimize reads of record batches by pushing limit to file level [iceberg-python]

via GitHub Tue, 13 Aug 2024 17:24:38 -0700


kevinjqliu commented on code in PR #1057:
URL: https://github.com/apache/iceberg-python/pull/1057#discussion_r1716133239



##########
pyiceberg/io/pyarrow.py:
##########
@@ -1194,6 +1194,7 @@ def _task_to_record_batches(
     case_sensitive: bool,
     name_mapping: Optional[NameMapping] = None,
     use_large_types: bool = True,
+    limit: Optional[int] = None,

Review Comment:
   I like the #1043 way where `_task_to_record_batches` is kept as is; its just 
generating more batches. The function which calls `_task_to_record_batches` can 
keep track of the `limit`



##########
pyiceberg/io/pyarrow.py:
##########
@@ -1366,6 +1373,7 @@ def project_table(
             case_sensitive,
             table_metadata.name_mapping(),
             use_large_types,
+            limit,

Review Comment:
   I think this change is enough to "pushdown the limit". `_task_to_table` will 
get batches from `_task_to_record_batches` and check the limit, returning a 
pyarrow Table when the limit is met. 
   
   If the limit is not met, `_task_to_table` will just return all the batches 
as a pyarrow Table



##########
pyiceberg/io/pyarrow.py:
##########
@@ -1265,6 +1270,7 @@ def _task_to_table(
     case_sensitive: bool,
     name_mapping: Optional[NameMapping] = None,
     use_large_types: bool = True,
+    limit: Optional[int] = None,
 ) -> Optional[pa.Table]:
     batches = list(

Review Comment:
   similar to the comment above, i think we can process the `limit` here 
instead of passing it down to `_task_to_record_batches`



##########
pyiceberg/io/pyarrow.py:
##########
@@ -1194,6 +1194,7 @@ def _task_to_record_batches(
     case_sensitive: bool,
     name_mapping: Optional[NameMapping] = None,
     use_large_types: bool = True,
+    limit: Optional[int] = None,

Review Comment:
   "technically" we can also push down the limit to the 
`ds.Scanner.from_fragment`'s `batch_size` parameter, but that's currently out 
of scope 
   
https://arrow.apache.org/docs/python/generated/pyarrow.dataset.Scanner.html#pyarrow.dataset.Scanner.from_fragment



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Optimize reads of record batches by pushing limit to file level [iceberg-python]

Reply via email to