count to iceberg

via GitHub Thu, 02 Feb 2023 12:15:32 -0800


huaxingao commented on code in PR #6622:
URL: https://github.com/apache/iceberg/pull/6622#discussion_r1095044926



##########
api/src/main/java/org/apache/iceberg/expressions/BoundAggregate.java:
##########
@@ -44,4 +57,85 @@ public Type type() {
       return term().type();
     }
   }
+
+  public String columnName() {
+    if (op() == Operation.COUNT_STAR) {
+      return "*";
+    } else {
+      return ref().name();
+    }
+  }
+
+  public String describe() {
+    switch (op()) {
+      case COUNT_STAR:
+        return "count(*)";
+      case COUNT:
+        return "count(" + ExpressionUtil.describe(term()) + ")";
+      case MAX:
+        return "max(" + ExpressionUtil.describe(term()) + ")";
+      case MIN:
+        return "min(" + ExpressionUtil.describe(term()) + ")";
+      default:
+        throw new UnsupportedOperationException("Unsupported aggregate type: " 
+ op());
+    }
+  }
+
+  <V> V safeGet(Map<Integer, V> map, int key) {
+    return safeGet(map, key, null);
+  }
+
+  <V> V safeGet(Map<Integer, V> map, int key, V defaultValue) {
+    if (map != null) {
+      return map.getOrDefault(key, defaultValue);
+    }
+
+    return null;
+  }
+
+  interface Aggregator<R> {
+    void update(StructLike struct);
+
+    void update(DataFile file);
+
+    R result();
+  }
+
+  abstract static class NullSafeAggregator<T, R> implements Aggregator<R> {
+    private final BoundAggregate<T, R> aggregate;
+    private boolean isNull = false;
+
+    NullSafeAggregator(BoundAggregate<T, R> aggregate) {
+      this.aggregate = aggregate;
+    }
+
+    protected abstract void update(R value);
+
+    protected abstract R current();
+
+    @Override
+    public void update(StructLike struct) {
+      if (!isNull) {
+        R value = aggregate.eval(struct);
+        update(value);
+      }
+    }
+
+    @Override
+    public void update(DataFile file) {
+      if (!isNull) {
+        R value = aggregate.eval(file);
+        update(value);

Review Comment:
   If one data file  evaluates `null`, I think we still want to evaluate the 
rest of the data files. For example,
   ```
   CREATE TABLE test (id LONG, data INT) USING iceberg PARTITIONED BY (id);
   
   INSERT INTO TABLE test VALUES (1, null), (1, null), (2, 33), (2, 44), (3, 
55), (3, 66);
   
   SELECT max(data) FROM test;
   ```
   For `max(data)`, the first data file evaluates null, I think we still want 
to evaluate the rest of the data files to get the max value `66` for 
`max(data)`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] huaxingao commented on a diff in pull request #6622: push down min/max/count to iceberg

Reply via email to