huan233usc commented on code in PR #16570:
URL: https://github.com/apache/iceberg/pull/16570#discussion_r3383060137
##########
spark/v4.1/spark/src/test/java/org/apache/iceberg/spark/sql/TestFilterPushDown.java:
##########
@@ -674,9 +706,20 @@ private void checkFilters(
assertThat(planAsString).as("Should be no post scan
filter").doesNotContain("Filter (");
}
- assertThat(planAsString)
- .as("Pushed filters must match")
- .contains(", filters=" + icebergFilters + ",");
+ int startIndex = planAsString.indexOf("filters=");
+ int endIndex = planAsString.indexOf("runtimeFilters");
+ String filterStringFromPlan = planAsString.substring(startIndex, endIndex);
+ Arrays.stream(icebergFilters)
+ .forEach(
+ filter -> {
+ assertThat(filterStringFromPlan).as("Pushed filters must
contain").contains(filter);
Review Comment:
Switching to per-filter contains makes these assertions prone to false
positives via substring matches, e.g. "id = 1" is contained in "id = 10", and
"dep > 'd3'" would match a plan rendering 'd30'.
Consider parsing the filters= segment into the actual filter list and
comparing it (sorted) against the expected set, so both presence and exactness
are validated.
##########
spark/v4.1/spark/src/main/java/org/apache/iceberg/spark/source/SparkScan.java:
##########
@@ -384,4 +389,78 @@ protected long adjustSplitSize(List<? extends ScanTask>
tasks, long splitSize) {
return splitSize;
}
}
+
+ protected static String createOrderedExprString(Stream<Expression>
exprStream) {
+ return exprStream
+ .flatMap(x -> ExpressionVisitors.visit(x,
ExpressionFlattener.INSTANCE).stream())
+ .map(Spark3Util::describe)
+ .sorted()
+ .collect(Collectors.joining(", "));
+ }
+
+ private static class ExpressionFlattener
+ extends ExpressionVisitors.ExpressionVisitor<List<Expression>> {
+
+ private static final ExpressionFlattener INSTANCE = new
ExpressionFlattener();
+
+ private ExpressionFlattener() {}
+
+ @Override
+ public List<Expression> alwaysTrue() {
+ return List.of(Expressions.alwaysTrue());
+ }
+
+ @Override
+ public List<Expression> alwaysFalse() {
+ return List.of(Expressions.alwaysFalse());
+ }
+
+ @Override
+ public List<Expression> not(List<Expression> result) {
+ // since its a list of expressions created by And, if more than 1, so it
will already be
+ // sorted.
+ return List.of(Expressions.not(mergeExpressions(result)));
+ }
+
+ @Override
+ public List<Expression> and(List<Expression> leftResult, List<Expression>
rightResult) {
+ List<Expression> flattened = Lists.newArrayList(leftResult);
+ flattened.addAll(rightResult);
+ // sort the flattened stream back else otherwise the subtree may have
ordering issue, when
+ // calculating hashCode
+ return flattened.stream()
+ .map(expr -> Pair.of(Spark3Util.describe(expr), expr))
Review Comment:
Do we really need the pair here?
Would the following work?
```
return flattened.stream()
.sorted(Comparator.comparing(Spark3Util::describe))
.toList();
```
##########
spark/v4.1/spark/src/main/java/org/apache/iceberg/spark/source/SparkRuntimeFilterableScan.java:
##########
@@ -195,6 +195,6 @@ private Expression convertRuntimePredicates(Predicate[]
predicates) {
}
protected String runtimeFiltersDesc() {
- return Spark3Util.describe(runtimeFilters);
+ return createOrderedExprString(runtimeFilters.stream());
Review Comment:
Let's keep them separate. Normalization (flatten + sort) is only needed for
the equals/hashCode contract that enables exchange reuse — that's the scope of
this bug.
Changing the user-facing EXPLAIN output (flattening top-level ANDs +
reordering) is a separate behavior change I'd rather not bundle into a bugfix,
since downstream tooling / golden-file tests may parse the plan. So:
- keep filtersDesc() / runtimeFiltersDesc() (used by description()) as
Spark3Util.describe(...),
- add a separate canonical method used only by equals/hashCode.
Making EXPLAIN deterministic can be a follow-up if we decide we want it.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]