kevinjqliu commented on issue #1032:
URL: 
https://github.com/apache/iceberg-python/issues/1032#issuecomment-2282819711

   Thanks for looking into the different scenarios. It looks like there are 
varying results depending on the engines.
   
   ### Read Path
   I took a deeper look into the read path for PyIceberg and discovered a 
missing optimization. 
   For context, reading an Iceberg table is done through the internal TableScan 
API. For example, `catalog.load_table(“foo.bar”).scan().to_arrow()` produces a 
pyarrow table. Tracing the call stack, here are the main areas to look at
   * 
[`to_arrow`](https://github.com/apache/iceberg-python/blob/d8b5c17cadbc99e53d08ade6109283ee73f0d83e/pyiceberg/table/__init__.py#L2033-L2044),
 gathers the TableScan parameters and materializes a Pyarrow Table. It calls 
`project_table` with an iterable of `FileScanTask` (`self.plan_files()`) and 
specified limit, if any.
   * 
[`project_table`](https://github.com/apache/iceberg-python/blob/d8b5c17cadbc99e53d08ade6109283ee73f0d83e/pyiceberg/io/pyarrow.py#L1361-L1386),
 takes the FileScanTasks, which corresponds to parquet files, and processes 
them asynchronously with futures. The futures will stop early if the limit is 
satisfied.
   
   So given an Iceberg table with 2 files. `scan(limit=1).to_arrow()` will 
produce 2 FileScanTasks (corresponding with each parquet file). Both tasks will 
be submitted simultaneously. After reading the entire parquet file, one of the 
tasks is completed. The limit is satisfied and the other future is cancelled.
   
   The important note here is that the unit of work is per FileScanTask, which 
means per parquet file read. The limit will be checked only after each future 
completion. This means at least 1 parquet file will be read entirely before the 
limit is applied. The limit is never pushed down to the parquet file read 
level. **Even when `limit=1` is specified, the entire parquet file will be 
read**. In your example, with or without limit, an entire ~100mb data file will 
be read. 
   
   The potential optimization here is to pushdown the `limit` to 
[`_task_to_table`](https://github.com/apache/iceberg-python/blob/d8b5c17cadbc99e53d08ade6109283ee73f0d83e/pyiceberg/io/pyarrow.py#L1262)
 and in turn to 
[`_task_to_record_batches`](https://github.com/apache/iceberg-python/blob/d8b5c17cadbc99e53d08ade6109283ee73f0d83e/pyiceberg/io/pyarrow.py#L1191).
 Perhaps check the limit against the number of records [when yielding each 
batch](https://github.com/apache/iceberg-python/blob/d8b5c17cadbc99e53d08ade6109283ee73f0d83e/pyiceberg/io/pyarrow.py#L1240-L1259).
   In fact, this optimization is already done in [`project_batches` 
](https://github.com/apache/iceberg-python/blob/d8b5c17cadbc99e53d08ade6109283ee73f0d83e/pyiceberg/io/pyarrow.py#L1457-L1475),
 which is used by 
[`to_arrow_batch_reader`](https://github.com/apache/iceberg-python/blob/d8b5c17cadbc99e53d08ade6109283ee73f0d83e/pyiceberg/table/__init__.py#L2046-L2063)
 (an alternative to `to_arrow`). 
   
   ### Unify implementations
   As a side note, there might be an opportunity to unify the implementation 
details of `to_arrow` and `to_arrow_batch_reader`.
   
   ### `to_arrow_batch_reader` bug
   There is also a bug with `to_arrow_batch_reader`, as it does not respecting 
the given `limit`, returning more records than specified. The bug is in 
[`project_batches`](https://github.com/apache/iceberg-python/blob/d8b5c17cadbc99e53d08ade6109283ee73f0d83e/pyiceberg/io/pyarrow.py#L1457-L1475),
 specifically with the way yield interacts with the two for-loops. Here’s a 
[Jupyter 
notebook](https://gist.github.com/kevinjqliu/1b5bc445ed12ca78e564db139c1e5683) 
reproducing the issue, see the last cell and the number of rows read by using 
`to_arrow` vs `to_arrow_batch_reader`.
   
   ## Summary
   * Optimize `to_arrow` to push down `limit` to the file level
   * Potentially unify implementation of `to_arrow` and `to_arrow_batch_reader`
   * Fix `to_arrow_batch_reader` to respect the given `limit`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to