Re: [PR] Spark partial limit push down [iceberg]

via GitHub Fri, 16 Aug 2024 09:19:51 -0700


singhpk234 commented on code in PR #10943:
URL: https://github.com/apache/iceberg/pull/10943#discussion_r1720037386



##########
spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/SparkSQLProperties.java:
##########
@@ -45,6 +45,10 @@ private SparkSQLProperties() {}
       "spark.sql.iceberg.aggregate-push-down.enabled";
   public static final boolean AGGREGATE_PUSH_DOWN_ENABLED_DEFAULT = true;
 
+  // Controls whether to push down limit to Iceberg
+  public static final String LIMIT_PUSH_DOWN_ENABLED = 
"spark.sql.iceberg.limit-push-down.enabled";
+  public static final boolean LIMIT_PUSH_DOWN_ENABLED_DEFAULT = true;

Review Comment:
   [Q] will this config be honoured when we use spark with orc / avro ? I see 
not should we namespace this to parquet then ? 



##########
arrow/src/main/java/org/apache/iceberg/arrow/vectorized/parquet/VectorizedColumnIterator.java:
##########
@@ -69,12 +69,20 @@ public boolean producesDictionaryEncodedVector() {
   }
 
   public abstract class BatchReader {
-    public void nextBatch(FieldVector fieldVector, int typeWidth, 
NullabilityHolder holder) {
+    public void nextBatch(
+        int numValsToRead, FieldVector fieldVector, int typeWidth, 
NullabilityHolder holder) {
       int rowsReadSoFar = 0;
-      while (rowsReadSoFar < batchSize && hasNext()) {
+      while (rowsReadSoFar < batchSize && hasNext() && rowsReadSoFar < 
numValsToRead) {
         advance();
+        int expectedBatchSize;
+        if (numValsToRead < 0) {
+          expectedBatchSize = batchSize - rowsReadSoFar;

Review Comment:
   
   [doubt] Considering rowsReadSoFar = 0 and numValsToRead < 0 ? in this case 
will we ever enter the while loop ? how are we keep the behaviour without the 
feature intact ? 
   



##########
spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/SparkScanBuilder.java:
##########
@@ -407,14 +422,35 @@ public Scan build() {
 
   private Scan buildBatchScan() {
     Schema expectedSchema = schemaWithMetadataColumns();
+    org.apache.iceberg.Scan scan =
+        buildIcebergBatchScan(false /* not include Column Stats */, 
expectedSchema);
+    if (pushedLimit > 0 && hasDeletes(scan)) {
+      LOG.info("Skipping limit pushdown: detected row level deletes");
+      pushedLimit = -1;
+    }

Review Comment:
   [Q] can we push down limit even when some data filters are not completely 
pushed down to scan layer ? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Spark partial limit push down [iceberg]

Reply via email to