Re: [PR] Spark partial limit push down [iceberg]

via GitHub Fri, 27 Sep 2024 17:44:42 -0700


huaxingao commented on code in PR #10943:
URL: https://github.com/apache/iceberg/pull/10943#discussion_r1779324357



##########
spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/SparkPartitioningAwareScan.java:
##########
@@ -78,7 +78,19 @@ abstract class SparkPartitioningAwareScan<T extends 
PartitionScanTask> extends S
       Schema expectedSchema,
       List<Expression> filters,
       Supplier<ScanReport> scanReportSupplier) {
-    super(spark, table, readConf, expectedSchema, filters, scanReportSupplier);
+    this(spark, table, scan, readConf, expectedSchema, filters, 
scanReportSupplier, null);
+  }
+
+  SparkPartitioningAwareScan(
+      SparkSession spark,
+      Table table,
+      Scan<?, ? extends ScanTask, ? extends ScanTaskGroup<?>> scan,
+      SparkReadConf readConf,
+      Schema expectedSchema,
+      List<Expression> filters,
+      Supplier<ScanReport> scanReportSupplier,
+      Integer pushedLimit) {

Review Comment:
   Thanks @RussellSpitzer for sharing your concern and suggestion!
   I think we can borrow @aokolnychyi 's idea of adding [`ParquetBatchReadConf` 
and 
`OrcBatchReadConf`](https://github.com/apache/iceberg/pull/9841#discussion_r1594829408).
 We can have something like 
   ```
   @Value.Immutable
   public interface ParquetBatchReadConf extends Serializable {
     ParquetReaderType readerType();  // this is for comet, we don't need this 
for now
     int batchSize();
     @Nullable
     Integer limit();
   }
   
   @Value.Immutable
   public interface OrcBatchReadConf extends Serializable {
     int batchSize();
   }
   ```
   Similarly, we can also have `ParquetRowReadConf` and `OrcRowReadConf`.
   
   I have changed the code to add `ParquetBatchReadConf` and 
`OrcBatchReadConf`. We still need to pass `pushedLimit` to the 
`SparkPartitioningAwareScan`, `SparkScan` and `SparkBatch` constructors, so 
`pushedLimit` can be passed from `SparkScanBuilder` to `SparkBatch`, this is 
because `pushedLimit` is not available in `SparkReadConf`, we have to call 
`SparkScanBuilder`.`pushLimit` to get `pushedLimit`.
   
   Please let me know if this approach looks OK to you. Thanks!
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Spark partial limit push down [iceberg]

Reply via email to