Re: [I] Peformance question for to_arrow, to_pandas, to_duckdb [iceberg-python]

via GitHub Sun, 11 Aug 2024 09:37:46 -0700


kevinjqliu commented on issue #1032:
URL: 
https://github.com/apache/iceberg-python/issues/1032#issuecomment-2282819711

Thanks for looking into the different scenarios. It looks like there are
varying results depending on the engines.

### Read Path
I took a deeper look into the read path for PyIceberg and discovered a
missing optimization.
For context, reading an Iceberg table is done through the internal TableScan
API. For example, `catalog.load_table(“foo.bar”).scan().to_arrow()` produces a
pyarrow table. Tracing the call stack, here are the main areas to look at
*
[`to_arrow`](https://github.com/apache/iceberg-python/blob/d8b5c17cadbc99e53d08ade6109283ee73f0d83e/pyiceberg/table/__init__.py#L2033-L2044),
gathers the TableScan parameters and materializes a Pyarrow Table. It calls
`project_table` with an iterable of `FileScanTask` (`self.plan_files()`) and
specified limit, if any.
*
[`project_table`](https://github.com/apache/iceberg-python/blob/d8b5c17cadbc99e53d08ade6109283ee73f0d83e/pyiceberg/io/pyarrow.py#L1361-L1386),
takes the FileScanTasks, which corresponds to parquet files, and processes
them asynchronously with futures. The futures will stop early if the limit is
satisfied.

So given an Iceberg table with 2 files. `scan(limit=1).to_arrow()` will
produce 2 FileScanTasks (corresponding with each parquet file). Both tasks will
be submitted simultaneously. After reading the entire parquet file, one of the
tasks is completed. The limit is satisfied and the other future is cancelled.

The important note here is that the unit of work is per FileScanTask, which
means per parquet file read. The limit will be checked only after each future
completion. This means at least 1 parquet file will be read entirely before the
limit is applied. The limit is never pushed down to the parquet file read
level. **Even when `limit=1` is specified, the entire parquet file will be
read**. In your example, with or without limit, an entire ~100mb data file will
be read.

The potential optimization here is to pushdown the `limit` to
[`_task_to_table`](https://github.com/apache/iceberg-python/blob/d8b5c17cadbc99e53d08ade6109283ee73f0d83e/pyiceberg/io/pyarrow.py#L1262)
and in turn to
[`_task_to_record_batches`](https://github.com/apache/iceberg-python/blob/d8b5c17cadbc99e53d08ade6109283ee73f0d83e/pyiceberg/io/pyarrow.py#L1191).
Perhaps check the limit against the number of records [when yielding each
batch](https://github.com/apache/iceberg-python/blob/d8b5c17cadbc99e53d08ade6109283ee73f0d83e/pyiceberg/io/pyarrow.py#L1240-L1259).
In fact, this optimization is already done in [`project_batches`
](https://github.com/apache/iceberg-python/blob/d8b5c17cadbc99e53d08ade6109283ee73f0d83e/pyiceberg/io/pyarrow.py#L1457-L1475),
which is used by
[`to_arrow_batch_reader`](https://github.com/apache/iceberg-python/blob/d8b5c17cadbc99e53d08ade6109283ee73f0d83e/pyiceberg/table/__init__.py#L2046-L2063)
(an alternative to `to_arrow`).

### Unify implementations
As a side note, there might be an opportunity to unify the implementation
details of `to_arrow` and `to_arrow_batch_reader`.

### `to_arrow_batch_reader` bug
There is also a bug with `to_arrow_batch_reader`, as it does not respecting
the given `limit`, returning more records than specified. The bug is in
[`project_batches`](https://github.com/apache/iceberg-python/blob/d8b5c17cadbc99e53d08ade6109283ee73f0d83e/pyiceberg/io/pyarrow.py#L1457-L1475),
specifically with the way yield interacts with the two for-loops. Here’s a
[Jupyter
notebook](https://gist.github.com/kevinjqliu/1b5bc445ed12ca78e564db139c1e5683)
reproducing the issue, see the last cell and the number of rows read by using
`to_arrow` vs `to_arrow_batch_reader`.

## Summary
* Optimize `to_arrow` to push down `limit` to the file level
* Potentially unify implementation of `to_arrow` and `to_arrow_batch_reader`
* Fix `to_arrow_batch_reader` to respect the given `limit`

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [I] Peformance question for to_arrow, to_pandas, to_duckdb [iceberg-python]

Reply via email to