[I] Iceberg BatchScan & SparkDistributedDataScan to support `limit` pushdown [iceberg]

via GitHub Wed, 25 Jun 2025 05:14:43 -0700


GPX99 opened a new issue, #13383:
URL: https://github.com/apache/iceberg/issues/13383


   ### Feature Request / Improvement
   
   Request to add `limit` pushdown to improve the performance of reading a big 
table by skipping full batch scan, where the batch scan is implemented 
[here](https://github.com/apache/iceberg/blob/apache-iceberg-1.9.1/spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/SparkScanBuilder.java#L755-L762)
   
   **How is this observed?**
   When `select * from table_name limit 1`, the spark will actually scan all 
the data from the table; the bigger the table, the longer it takes. 
   
   For example, 
   ```
   (1) BatchScan glue_catalog.lakehouse_bronze.table_name
   Output [51]: [ISTEST#69, LEADUUID#70, UPDATEDAT#71, ...etc]
   glue_catalog.lakehouse_bronze.table_name (branch=null) [filters=, 
groupedBy=] <-- don't have limit pushdown
   ```
   Hence, the input size is big
   <img width="687" alt="Image" 
src="https://github.com/user-attachments/assets/864d9349-6280-439f-8689-4a66541a6e4c";
 />
   
   ### Query engine
   
   Spark
   
   ### Willingness to contribute
   
   - [ ] I can contribute this improvement/feature independently
   - [x] I would be willing to contribute this improvement/feature with 
guidance from the Iceberg community
   - [ ] I cannot contribute this improvement/feature at this time


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[I] Iceberg BatchScan & SparkDistributedDataScan to support `limit` pushdown [iceberg]

Reply via email to