[GitHub] [iceberg] rdblue commented on a diff in pull request #7886: Spark 3.4: Support pushing down system functions by V2 filters

via GitHub Wed, 05 Jul 2023 13:06:12 -0700


rdblue commented on code in PR #7886:
URL: https://github.com/apache/iceberg/pull/7886#discussion_r1253593455



##########
spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/SparkV2Filters.java:
##########
@@ -360,10 +427,96 @@ private static boolean hasNoInFilter(Predicate predicate) 
{
   }
 
   private static boolean isSupportedInPredicate(Predicate predicate) {
-    if (!isRef(childAtIndex(predicate, 0))) {
+    if (!couldConvert(childAtIndex(predicate, 0))) {
       return false;
     } else {
       return 
Arrays.stream(predicate.children()).skip(1).allMatch(SparkV2Filters::isLiteral);
     }
   }
+
+  private static <T> UnboundTerm<Object> toTerm(T input) {
+    if (input instanceof NamedReference) {
+      return Expressions.ref(SparkUtil.toColumnName((NamedReference) input));
+
+    } else if (input instanceof UserDefinedScalarFunc) {
+      return udfToTerm((UserDefinedScalarFunc) input);
+
+    } else {
+
+      return null;
+    }
+  }
+
+  @VisibleForTesting
+  @SuppressWarnings("unchecked")
+  static UnboundTerm<Object> udfToTerm(UserDefinedScalarFunc udf) {
+    switch (udf.name().toLowerCase(Locale.ROOT)) {
+      case "years":
+        Preconditions.checkArgument(
+            udf.children().length == 1, "years function should have only one 
children (column)");
+        if (isRef(udf.children()[0])) {
+          return year(SparkUtil.toColumnName((NamedReference) 
udf.children()[0]));
+        }
+
+        return null;
+      case "months":
+        Preconditions.checkArgument(
+            udf.children().length == 1, "months function should have only one 
children (column)");
+        if (isRef(udf.children()[0])) {
+          return month(SparkUtil.toColumnName((NamedReference) 
udf.children()[0]));
+        }
+
+        return null;
+      case "days":
+        Preconditions.checkArgument(
+            udf.children().length == 1, "days function should have only one 
children (column)");
+        if (isRef(udf.children()[0])) {
+          return day(SparkUtil.toColumnName((NamedReference) 
udf.children()[0]));
+        }
+
+        return null;
+      case "hours":
+        Preconditions.checkArgument(
+            udf.children().length == 1, "hours function should have only one 
children (colum)");
+        if (isRef(udf.children()[0])) {
+          return hour(SparkUtil.toColumnName((NamedReference) 
udf.children()[0]));
+        }
+
+        return null;
+      case "bucket":
+        Preconditions.checkArgument(
+            udf.children().length == 2,
+            "bucket function should have two children (numBuckets and 
column)");
+        if (isLiteral(udf.children()[0]) && isRef(udf.children()[1])) {
+          int numBuckets = (Integer) convertLiteral((Literal<?>) 
udf.children()[0]);
+          return bucket(SparkUtil.toColumnName((NamedReference) 
udf.children()[1]), numBuckets);
+        }
+
+        return null;
+      case "truncate":
+        Preconditions.checkArgument(
+            udf.children().length == 2,
+            "truncate function should have two children (width and column)");
+        if (isLiteral(udf.children()[0]) && isRef(udf.children()[1])) {
+          int width = (Integer) convertLiteral((Literal<?>) udf.children()[0]);
+          return truncate(SparkUtil.toColumnName((NamedReference) 
udf.children()[1]), width);
+        }
+
+        return null;
+      default:
+        return null;
+    }
+  }
+
+  private static class PredicateChildren {
+    private final UnboundTerm<Object> term;
+    private final Object value;
+    private final boolean termOnLeft;

Review Comment:
   That's the problem with the method. It only works for equality.
   
   In general, I'm not a big fan of the refactoring happening in this because I 
think it makes it much harder to review. Can you avoid changing as much as 
possible?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] rdblue commented on a diff in pull request #7886: Spark 3.4: Support pushing down system functions by V2 filters

Reply via email to