Re: [PR] Spark : Derive Stats From Manifest on the Fly [iceberg]

via GitHub Tue, 26 Nov 2024 16:02:53 -0800


guykhazma commented on PR #11615:
URL: https://github.com/apache/iceberg/pull/11615#issuecomment-2502221330


   @huaxingao yes, it is possible to reuse the logic from the aggregate 
pushdown by reusing the AggregateEvaluator instead of the current code to 
aggregate from the manifests. Something along these lines:
   ```Java
         List<Expression> expressions = table.schema().columns().stream()
                 .map(field -> {
                   String colName = field.name(); // Extract the column name
                   // Create expressions for max and min non-null count
                   return List.of(
                           Expressions.min(colName),
                           Expressions.max(colName),
                           Expressions.count(colName)
                   );
                 })
                 .flatMap(List::stream) // Flatten the lists into a single 
stream
                 .collect(Collectors.toList());
   
         AggregateEvaluator aggregateEvaluator = 
AggregateEvaluator.create(table.schema(),
                 expressions);
         for (FileScanTask task : fileScanTasks) {
           aggregateEvaluator.update(task.file());
         }
   
         if (!aggregateEvaluator.allAggregatorsValid()) {
           return;
         }
         // get the total row count to compute the number of null rows
         long rowsCount = 
taskGroups().stream().mapToLong(ScanTaskGroup::estimatedRowsCount).sum();
         // populate the map with the results
         StructLike res = aggregateEvaluator.result();
         IntStream.range(0, table.schema().columns().size())
                 .forEach(i -> {
                   minValues.put(table.schema().columns().get(i).fieldId(), 
res.get(i*3, Object.class));
                   maxValues.put(table.schema().columns().get(i).fieldId(), 
res.get(i*3 + 1, Object.class));
                   nullCounts.put(table.schema().columns().get(i).fieldId(),
                           rowsCount - res.get(i*3 + 2, Long.class));
                 });
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Spark : Derive Stats From Manifest on the Fly [iceberg]

Reply via email to