kevinjqliu commented on issue #1032: URL: https://github.com/apache/iceberg-python/issues/1032#issuecomment-2282819711
Thanks for looking into the different scenarios. It looks like there are varying results depending on the engines. ### Read Path I took a deeper look into the read path for PyIceberg and discovered a missing optimization. For context, reading an Iceberg table is done through the internal TableScan API. For example, `catalog.load_table(“foo.bar”).scan().to_arrow()` produces a pyarrow table. Tracing the call stack, here are the main areas to look at * [`to_arrow`](https://github.com/apache/iceberg-python/blob/d8b5c17cadbc99e53d08ade6109283ee73f0d83e/pyiceberg/table/__init__.py#L2033-L2044), gathers the TableScan parameters and materializes a Pyarrow Table. It calls `project_table` with an iterable of `FileScanTask` (`self.plan_files()`) and specified limit, if any. * [`project_table`](https://github.com/apache/iceberg-python/blob/d8b5c17cadbc99e53d08ade6109283ee73f0d83e/pyiceberg/io/pyarrow.py#L1361-L1386), takes the FileScanTasks, which corresponds to parquet files, and processes them asynchronously with futures. The futures will stop early if the limit is satisfied. So given an Iceberg table with 2 files. `scan(limit=1).to_arrow()` will produce 2 FileScanTasks (corresponding with each parquet file). Both tasks will be submitted simultaneously. After reading the entire parquet file, one of the tasks is completed. The limit is satisfied and the other future is cancelled. The important note here is that the unit of work is per FileScanTask, which means per parquet file read. The limit will be checked only after each future completion. This means at least 1 parquet file will be read entirely before the limit is applied. The limit is never pushed down to the parquet file read level. **Even when `limit=1` is specified, the entire parquet file will be read**. In your example, with or without limit, an entire ~100mb data file will be read. The potential optimization here is to pushdown the `limit` to [`_task_to_table`](https://github.com/apache/iceberg-python/blob/d8b5c17cadbc99e53d08ade6109283ee73f0d83e/pyiceberg/io/pyarrow.py#L1262) and in turn to [`_task_to_record_batches`](https://github.com/apache/iceberg-python/blob/d8b5c17cadbc99e53d08ade6109283ee73f0d83e/pyiceberg/io/pyarrow.py#L1191). Perhaps check the limit against the number of records [when yielding each batch](https://github.com/apache/iceberg-python/blob/d8b5c17cadbc99e53d08ade6109283ee73f0d83e/pyiceberg/io/pyarrow.py#L1240-L1259). In fact, this optimization is already done in [`project_batches` ](https://github.com/apache/iceberg-python/blob/d8b5c17cadbc99e53d08ade6109283ee73f0d83e/pyiceberg/io/pyarrow.py#L1457-L1475), which is used by [`to_arrow_batch_reader`](https://github.com/apache/iceberg-python/blob/d8b5c17cadbc99e53d08ade6109283ee73f0d83e/pyiceberg/table/__init__.py#L2046-L2063) (an alternative to `to_arrow`). ### Unify implementations As a side note, there might be an opportunity to unify the implementation details of `to_arrow` and `to_arrow_batch_reader`. ### `to_arrow_batch_reader` bug There is also a bug with `to_arrow_batch_reader`, as it does not respecting the given `limit`, returning more records than specified. The bug is in [`project_batches`](https://github.com/apache/iceberg-python/blob/d8b5c17cadbc99e53d08ade6109283ee73f0d83e/pyiceberg/io/pyarrow.py#L1457-L1475), specifically with the way yield interacts with the two for-loops. Here’s a [Jupyter notebook](https://gist.github.com/kevinjqliu/1b5bc445ed12ca78e564db139c1e5683) reproducing the issue, see the last cell and the number of rows read by using `to_arrow` vs `to_arrow_batch_reader`. ## Summary * Optimize `to_arrow` to push down `limit` to the file level * Potentially unify implementation of `to_arrow` and `to_arrow_batch_reader` * Fix `to_arrow_batch_reader` to respect the given `limit` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org