Re: [PR] [opt](multi-catalog) Optimize file split size. [doris]

via GitHub Thu, 25 Dec 2025 07:22:18 -0800


morningman commented on code in PR #58858:
URL: https://github.com/apache/doris/pull/58858#discussion_r2647114335



##########
fe/fe-core/src/main/java/org/apache/doris/datasource/iceberg/source/IcebergScanNode.java:
##########
@@ -359,8 +360,44 @@ public TableScan createTableScan() throws UserException {
     }
 
     private CloseableIterable<FileScanTask> planFileScanTask(TableScan scan) {
-        long targetSplitSize = getRealFileSplitSize(0);
-        return TableScanUtil.splitFiles(scan.planFiles(), targetSplitSize);
+        if (sessionVariable.getFileSplitSize() > 0) {
+            return TableScanUtil.splitFiles(scan.planFiles(),
+                    sessionVariable.getFileSplitSize());
+        }
+        if (isBatchMode()) {
+            // Currently iceberg batch split mode will use max split size.
+            // TODO: dynamic split size in batch split mode need to customize 
iceberg splitter.
+            return TableScanUtil.splitFiles(scan.planFiles(), 
sessionVariable.getMaxSplitSize());
+        }
+
+        // Non Batch Mode
+        // Materialize planFiles() into a list to avoid iterating the 
CloseableIterable twice.
+        // It will cost memory if the table is large.
+        List<FileScanTask> fileScanTaskList = new ArrayList<>();
+        try (CloseableIterable<FileScanTask> scanTasksIter = scan.planFiles()) 
{
+            for (FileScanTask task : scanTasksIter) {
+                fileScanTaskList.add(task);
+            }
+        } catch (Exception e) {
+            throw new RuntimeException("Failed to materialize file scan 
tasks", e);
+        }
+
+        targetSplitSize = determineTargetFileSplitSize(fileScanTaskList);
+        return 
TableScanUtil.splitFiles(CloseableIterable.withNoopClose(fileScanTaskList), 
targetSplitSize);
+    }
+
+    private long determineTargetFileSplitSize(Iterable<FileScanTask> tasks) {
+        long result = sessionVariable.getMaxInitialSplitSize();
+        long accumulatedTotalFileSize = 0;
+        for (FileScanTask task : tasks) {

Review Comment:
   Should we extract a common method for both hive and iceberg table?
   This logic looks just same as method in hive scan node.



##########
fe/fe-core/src/main/java/org/apache/doris/datasource/tvf/source/TVFScanNode.java:
##########
@@ -161,6 +167,23 @@ public List<Split> getSplits(int numBackends) throws 
UserException {
         return splits;
     }
 
+    private long determineTargetFileSplitSize(List<TBrokerFileStatus> 
fileStatuses) {
+        if (sessionVariable.getFileSplitSize() > 0) {
+            return sessionVariable.getFileSplitSize();
+        }
+        long result = sessionVariable.getMaxInitialSplitSize();
+        long totalFileSize = 0;
+        for (TBrokerFileStatus fileStatus : fileStatuses) {

Review Comment:
   same suggestion



##########
fe/fe-core/src/main/java/org/apache/doris/datasource/paimon/source/PaimonScanNode.java:
##########
@@ -384,12 +388,43 @@ public List<Split> getSplits(int numBackends) throws 
UserException {
 
         // We need to set the target size for all splits so that we can 
calculate the
         // proportion of each split later.
-        splits.forEach(s -> s.setTargetSplitSize(realFileSplitSize));
+        splits.forEach(s -> 
s.setTargetSplitSize(sessionVariable.getFileSplitSize() > 0
+                ? sessionVariable.getFileSplitSize() : 
sessionVariable.getMaxSplitSize()));
 
         this.selectedPartitionNum = partitionInfoMaps.size();
         return splits;
     }
 
+    private long determineTargetFileSplitSize(List<DataSplit> dataSplits,
+            boolean isBatchMode) {
+        if (sessionVariable.getFileSplitSize() > 0) {
+            return sessionVariable.getFileSplitSize();
+        }
+        /** Paimon batch split mode will return 0. and 
<code>FileSplitter</code>
+         *  will determine file split size.
+         */
+        if (isBatchMode) {
+            return 0;
+        }
+        long result = sessionVariable.getMaxInitialSplitSize();
+        long totalFileSize = 0;
+        for (DataSplit dataSplit : dataSplits) {

Review Comment:
   Same suggestion



##########
fe/fe-core/src/main/java/org/apache/doris/qe/SessionVariable.java:
##########
@@ -2173,6 +2177,15 @@ public boolean isEnableHboNonStrictMatchingMode() {
     @VariableMgr.VarAttr(name = FILE_SPLIT_SIZE, needForward = true)
     public long fileSplitSize = 0;
 
+    @VariableMgr.VarAttr(name = MAX_INITIAL_FILE_SPLIT_SIZE, needForward = 
true)

Review Comment:
   Add description field



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [opt](multi-catalog) Optimize file split size. [doris]

Reply via email to