Re: [PR] Bug-16563: Imposed an ordering on the filter expressions while checking for equality and hashCode of spark scans. [iceberg]

via GitHub Tue, 09 Jun 2026 12:26:10 -0700


huan233usc commented on code in PR #16570:
URL: https://github.com/apache/iceberg/pull/16570#discussion_r3383060137



##########
spark/v4.1/spark/src/test/java/org/apache/iceberg/spark/sql/TestFilterPushDown.java:
##########
@@ -674,9 +706,20 @@ private void checkFilters(
       assertThat(planAsString).as("Should be no post scan 
filter").doesNotContain("Filter (");
     }
 
-    assertThat(planAsString)
-        .as("Pushed filters must match")
-        .contains(", filters=" + icebergFilters + ",");
+    int startIndex = planAsString.indexOf("filters=");
+    int endIndex = planAsString.indexOf("runtimeFilters");
+    String filterStringFromPlan = planAsString.substring(startIndex, endIndex);
+    Arrays.stream(icebergFilters)
+        .forEach(
+            filter -> {
+              assertThat(filterStringFromPlan).as("Pushed filters must 
contain").contains(filter);

Review Comment:
   Switching to per-filter contains makes these assertions prone to false 
positives via substring matches, e.g. "id = 1" is contained in "id = 10", and 
"dep > 'd3'" would match a plan rendering 'd30'. 
   
   Consider parsing the filters= segment into the actual filter list and 
comparing it (sorted) against the expected set, so both presence and exactness 
are validated.



##########
spark/v4.1/spark/src/main/java/org/apache/iceberg/spark/source/SparkScan.java:
##########
@@ -384,4 +389,78 @@ protected long adjustSplitSize(List<? extends ScanTask> 
tasks, long splitSize) {
       return splitSize;
     }
   }
+
+  protected static String createOrderedExprString(Stream<Expression> 
exprStream) {
+    return exprStream
+        .flatMap(x -> ExpressionVisitors.visit(x, 
ExpressionFlattener.INSTANCE).stream())
+        .map(Spark3Util::describe)
+        .sorted()
+        .collect(Collectors.joining(", "));
+  }
+
+  private static class ExpressionFlattener
+      extends ExpressionVisitors.ExpressionVisitor<List<Expression>> {
+
+    private static final ExpressionFlattener INSTANCE = new 
ExpressionFlattener();
+
+    private ExpressionFlattener() {}
+
+    @Override
+    public List<Expression> alwaysTrue() {
+      return List.of(Expressions.alwaysTrue());
+    }
+
+    @Override
+    public List<Expression> alwaysFalse() {
+      return List.of(Expressions.alwaysFalse());
+    }
+
+    @Override
+    public List<Expression> not(List<Expression> result) {
+      // since its a list of expressions created by And, if more than 1, so it 
will already be
+      // sorted.
+      return List.of(Expressions.not(mergeExpressions(result)));
+    }
+
+    @Override
+    public List<Expression> and(List<Expression> leftResult, List<Expression> 
rightResult) {
+      List<Expression> flattened = Lists.newArrayList(leftResult);
+      flattened.addAll(rightResult);
+      // sort the flattened stream back else otherwise the subtree may have 
ordering issue, when
+      // calculating hashCode
+      return flattened.stream()
+          .map(expr -> Pair.of(Spark3Util.describe(expr), expr))

Review Comment:
   Do we really need the pair here?
   
   Would the following work?
   ```
   return flattened.stream()
       .sorted(Comparator.comparing(Spark3Util::describe))
       .toList();
   ```



##########
spark/v4.1/spark/src/main/java/org/apache/iceberg/spark/source/SparkRuntimeFilterableScan.java:
##########
@@ -195,6 +195,6 @@ private Expression convertRuntimePredicates(Predicate[] 
predicates) {
   }
 
   protected String runtimeFiltersDesc() {
-    return Spark3Util.describe(runtimeFilters);
+    return createOrderedExprString(runtimeFilters.stream());

Review Comment:
   Let's keep them separate. Normalization (flatten + sort) is only needed for 
the equals/hashCode contract that enables exchange reuse — that's the scope of 
this bug.
   
   Changing the user-facing EXPLAIN output (flattening top-level ANDs + 
reordering) is a separate behavior change I'd rather not bundle into a bugfix, 
since downstream tooling / golden-file tests may parse the plan. So:
   - keep filtersDesc() / runtimeFiltersDesc() (used by description()) as 
Spark3Util.describe(...),
   - add a separate canonical method used only by equals/hashCode.
   
   Making EXPLAIN deterministic can be a follow-up if we decide we want it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Bug-16563: Imposed an ordering on the filter expressions while checking for equality and hashCode of spark scans. [iceberg]

Reply via email to