advancedxy commented on code in PR #8346:
URL: https://github.com/apache/iceberg/pull/8346#discussion_r1300790470
##########
core/src/main/java/org/apache/iceberg/BaseFileScanTask.java:
##########
@@ -45,31 +49,67 @@ protected FileScanTask self() {
@Override
protected FileScanTask newSplitTask(FileScanTask parentTask, long offset,
long length) {
- return new SplitScanTask(offset, length, parentTask);
+ return new SplitScanTask(offset, length, deletesSizeBytes(), parentTask);
}
@Override
public List<DeleteFile> deletes() {
- return ImmutableList.copyOf(deletes);
+ if (deletesAsList == null) {
+ this.deletesAsList = ImmutableList.copyOf(deletes);
+ }
+
+ return deletesAsList;
+ }
+
+ @Override
+ public long sizeBytes() {
+ return length() + deletesSizeBytes();
+ }
+
+ @Override
+ public int filesCount() {
+ return 1 + deletes.length;
}
@Override
public Schema schema() {
return super.schema();
}
+ private long deletesSizeBytes() {
+ if (deletesSizeBytes == null) {
Review Comment:
Just checked the spark code(flink probably couldn't handle that scale),
seems I was misunderstanding how tasks are serialized. I used to think only the
specific task is serialized and sent to executor for one partition.
Seems like the rdd is serialized as a whole task binary for each task and
sent to executor. That definitely adds a lot of overhead even for one extra
field.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]