advancedxy commented on code in PR #8346:
URL: https://github.com/apache/iceberg/pull/8346#discussion_r1299213591
##########
core/src/main/java/org/apache/iceberg/BaseFileScanTask.java:
##########
@@ -45,31 +50,67 @@ protected FileScanTask self() {
@Override
protected FileScanTask newSplitTask(FileScanTask parentTask, long offset,
long length) {
- return new SplitScanTask(offset, length, parentTask);
+ return new SplitScanTask(offset, length, deletesSizeBytes(), parentTask);
}
@Override
public List<DeleteFile> deletes() {
- return ImmutableList.copyOf(deletes);
+ if (deletesAsList == null) {
+ this.deletesAsList =
Collections.unmodifiableList(Arrays.asList(deletes));
+ }
+
+ return deletesAsList;
+ }
+
+ @Override
+ public long sizeBytes() {
+ return length() + deletesSizeBytes();
+ }
+
+ @Override
+ public int filesCount() {
+ return 1 + deletes.length;
Review Comment:
Nit: If we were adding more methods to the parent class, how can we make
sure new methods are override in this method? Otherwise, it would probably
accidentally materializing `deletesAsList`?
I don't think the above question is a blocker, and It would be great if we
have some way/tests to detect that.
##########
core/src/main/java/org/apache/iceberg/BaseFileScanTask.java:
##########
@@ -45,31 +49,67 @@ protected FileScanTask self() {
@Override
protected FileScanTask newSplitTask(FileScanTask parentTask, long offset,
long length) {
- return new SplitScanTask(offset, length, parentTask);
+ return new SplitScanTask(offset, length, deletesSizeBytes(), parentTask);
}
@Override
public List<DeleteFile> deletes() {
- return ImmutableList.copyOf(deletes);
+ if (deletesAsList == null) {
+ this.deletesAsList = ImmutableList.copyOf(deletes);
+ }
+
+ return deletesAsList;
+ }
+
+ @Override
+ public long sizeBytes() {
+ return length() + deletesSizeBytes();
+ }
+
+ @Override
+ public int filesCount() {
+ return 1 + deletes.length;
}
@Override
public Schema schema() {
return super.schema();
}
+ private long deletesSizeBytes() {
+ if (deletesSizeBytes == null) {
Review Comment:
8 (size of long) * 1_000_000(1 million) = ~8MB, I wouldn't care too much
about this especially the tasks are serialized to multiple executors in
multiple rounds(in Spark query engine).
However it do add unnecessary overhead for ScanTask without delete files. So
a transient long and lazy calculation would be nice.
##########
core/src/main/java/org/apache/iceberg/BaseFileScanTask.java:
##########
@@ -28,6 +28,10 @@ public class BaseFileScanTask extends
BaseContentScanTask<FileScanTask, DataFile
implements FileScanTask {
private final DeleteFile[] deletes;
+ // lazy variables
+ private transient volatile List<DeleteFile> deletesAsList = null;
+ private transient volatile Long deletesSizeBytes = null;
Review Comment:
Thanks for detail explanation.
On a second thought, how about declare it as a normal transient long, such
as:
```java
private transient volatile long deletesSizeBytes = 0;
private long deletesSizeBytes() {
if (deletesSizeBytes == 0) { // the deletesSizeBytes might not
initialized yet.
long size = 0L;
for (DeleteFile deleteFile : deletes) {
size += deleteFile.fileSizeInBytes();
}
this.deletesSizeBytes = size;
}
return deletesSizeBytes;
}
```
We just need to pay a small addition check for no delete file cases: which
is iterating an empty array.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]