aokolnychyi commented on code in PR #8346:
URL: https://github.com/apache/iceberg/pull/8346#discussion_r1298831695
##########
api/src/main/java/org/apache/iceberg/BaseScanTaskGroup.java:
##########
@@ -48,20 +54,59 @@ public StructLike groupingKey() {
@Override
@SuppressWarnings("unchecked")
public Collection<T> tasks() {
- if (taskList == null) {
+ if (taskCollection == null) {
synchronized (this) {
- if (taskList == null) {
+ if (taskCollection == null) {
ImmutableList.Builder<T> listBuilder =
ImmutableList.builderWithExpectedSize(tasks.length);
for (Object task : tasks) {
listBuilder.add((T) task);
}
- taskList = listBuilder.build();
+ this.taskCollection = listBuilder.build();
}
}
}
- return taskList;
+ return taskCollection;
+ }
+
+ @Override
+ public long sizeBytes() {
+ if (sizeBytes == Long.MIN_VALUE) {
+ long size = 0L;
+ for (Object task : tasks) {
+ size += ((ScanTask) task).sizeBytes();
+ }
+ this.sizeBytes = size;
+ }
+
+ return sizeBytes;
+ }
+
+ @Override
+ public long estimatedRowsCount() {
Review Comment:
Caching isn't my primary goal. When profiling distributed planning, I
noticed we generate tons of garbage while planning task groups and it sometimes
takes up to 2/3 of the planning time to just plan groups for full table scans
with millions of files. My primary motivation is to iterate over the array of
tasks, instead of using the parent implementation with `LongStream` (which is
slow and generate many unnecessary objects) or using an iterator-based
approach). For scans with 10+ million files, this overhead adds up, especially
when we are running low on memory.
Internally, we did have a cache of tasks groups that were reused in multiple
Spark scans. These metrics are being used for reporting stats to engines so
while caching isn't the primary goal, it seems simple enough to do it and may
be helpful if we also decide to cache task groups in the future.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]