xiaoxuandev opened a new pull request, #13451: URL: https://github.com/apache/iceberg/pull/13451
This PR implements limit pushdown optimization for Iceberg on Spark 3.5 and 4.0, enabling early termination during scan task planning to improve performance for `LIMIT` queries. Resolves: #13383 ### Notes Since Spark's native limit pushdown has limitations when filters are present, this implementation: 1. Leverages Spark's native partial limit pushdown when available _(e.g., `SELECT * FROM table LIMIT n` or queries with partition pruning)_ 2. Implements Iceberg-level early termination during task group planning once the required number of records is reached. 3. disable limit push down when `preserve-data-grouping` is enabled. ### Testing - Unit Tests - Performance Benchmarks #### Benchmark Results (These results are illustrative, table with large number of data files generally lead to longer execution times if limit push down is disabled.) #### 1 row per data file | Query Type | Limit |Push Down Enabled | Push Down Disabled | Improvement | |------------------|--------|-------------------|---------------------|-------------| | Limit Query | 100 | 0.093 sec | 37.96 sec | **99.75% faster** | | Limit Query | 1000 | 0.484 sec | 41.04 sec | **98.82% faster** | | Limit Query | 10000 | 7.023 sec | 38.99 sec | **81.99% faster** | #### 5000 rows per data file | Query Type | Limit | Push Down Enabled | Push Down Disabled | Improvement | |--------------|--------|----------------|-------------------|-------------| | Limit Query | 100 | 0.0163s | 0.0488s | **66.5% faster** | | Limit Query | 1000 | 0.0170s | 0.0499s | **66.0% faster** | | Limit Query | 10000 | 0.0177s | 0.0632s | **71.9% faster** | #### 20000 rows per data file | Query Type | Limit | Push Down Enabled | Push Down Disabled | Improvement | |--------------|--------|----------------|-------------------|-------------| | Limit Query | 100 | 0.0416s | 0.0529s | **21.4% faster** | | Limit Query | 1000 | 0.0421s | 0.0524s | **19.7% faster** | | Limit Query | 10000 | 0.0422s | 0.0576s | **26.7% faster** | -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org