[GitHub] [iceberg] aokolnychyi commented on a diff in pull request #8346: Core: Lazily cache user-facing state in BaseFileScanTask

via GitHub Fri, 18 Aug 2023 13:13:08 -0700


aokolnychyi commented on code in PR #8346:
URL: https://github.com/apache/iceberg/pull/8346#discussion_r1298831695



##########
api/src/main/java/org/apache/iceberg/BaseScanTaskGroup.java:
##########
@@ -48,20 +54,59 @@ public StructLike groupingKey() {
   @Override
   @SuppressWarnings("unchecked")
   public Collection<T> tasks() {
-    if (taskList == null) {
+    if (taskCollection == null) {
       synchronized (this) {
-        if (taskList == null) {
+        if (taskCollection == null) {
           ImmutableList.Builder<T> listBuilder =
               ImmutableList.builderWithExpectedSize(tasks.length);
           for (Object task : tasks) {
             listBuilder.add((T) task);
           }
-          taskList = listBuilder.build();
+          this.taskCollection = listBuilder.build();
         }
       }
     }
 
-    return taskList;
+    return taskCollection;
+  }
+
+  @Override
+  public long sizeBytes() {
+    if (sizeBytes == Long.MIN_VALUE) {
+      long size = 0L;
+      for (Object task : tasks) {
+        size += ((ScanTask) task).sizeBytes();
+      }
+      this.sizeBytes = size;
+    }
+
+    return sizeBytes;
+  }
+
+  @Override
+  public long estimatedRowsCount() {

Review Comment:
   Caching isn't my primary goal. When profiling distributed planning, I 
noticed we generate tons of garbage while planning task groups and it sometimes 
takes up to 2/3 of the planning time to just plan groups for full table scans 
with millions of files. My primary motivation is to iterate over the array of 
tasks, instead of using the parent implementation with `LongStream` (which is 
slow and generate many unnecessary objects) or using an iterator-based 
approach). For scans with 10+ million files, this overhead adds up, especially 
when we are running low on memory.
   
   Internally, we did have a cache of tasks groups that were reused in multiple 
Spark scans. These metrics are being used for reporting stats to engines so 
while caching isn't the primary goal, it seems simple enough to do it and may 
be helpful if we also decide to cache task groups in the future.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] aokolnychyi commented on a diff in pull request #8346: Core: Lazily cache user-facing state in BaseFileScanTask

Reply via email to