Re: [PR] chore: add timing logs for file index partition and file listing [hudi]

via GitHub Wed, 08 Apr 2026 20:10:02 -0700


yihua commented on code in PR #18417:
URL: https://github.com/apache/hudi/pull/18417#discussion_r3055149157



##########
hudi-common/src/main/java/org/apache/hudi/BaseHoodieTableFileIndex.java:
##########
@@ -385,7 +394,13 @@ protected List<PartitionPath> 
listPartitionPaths(List<String> relativePartitionP
           HoodieTimeline timelineToQuery = findInstantsInRange();
           matchedPartitionPaths = 
TimelineUtils.getWrittenPartitions(timelineToQuery);
         } else {

Review Comment:
   🤖 Good observation. `TimelineUtils.getWrittenPartitions` just scans the 
already-loaded in-memory timeline — there's no I/O or MDT call involved, so 
it's very unlikely to be a latency contributor. Timing it would add noise 
without much diagnostic value, so leaving it untimed seems intentional and 
reasonable.



##########
hudi-common/src/main/java/org/apache/hudi/BaseHoodieTableFileIndex.java:
##########
@@ -476,6 +491,12 @@ private List<StoragePathInfo> 
listPartitionPathFiles(List<PartitionPath> partiti
     Set<StoragePath> missingPartitionPaths =
         CollectionUtils.diffSet(partitionPaths, cachedPartitionPaths.keySet());
 
+    if (missingPartitionPaths.isEmpty()) {
+      return cachedPartitionPaths.values().stream()
+          .flatMap(Collection::stream)
+          .collect(Collectors.toList());
+    }
+

Review Comment:
   🤖 Agreed that the PR description needs updating. On the equivalence 
question: when all paths are cached the original code would have called 
`getAllFilesInPartitions` with an empty keyset and then merged the (empty) 
result into `cachedPartitionPaths` values — so the returned list is identical. 
The early return is safe, but it does skip that empty-set call entirely, which 
could matter if any `tableMetadata` implementation has side effects on an empty 
invocation (metrics recording, etc.). Worth a note in the description that this 
is an intentional functional optimization, not just observability.



##########
hudi-common/src/main/java/org/apache/hudi/BaseHoodieTableFileIndex.java:
##########
@@ -357,18 +358,26 @@ private Map<PartitionPath, List<FileSlice>> 
filterFiles(List<PartitionPath> part
                       .orElseGet(() -> 
finalFileSystemView.getLatestFileSlices(partitionPath.path))
                       .collect(Collectors.toList())
           ));
+    } finally {
+      log.debug("On {} with query instant as {}, it took {}ms to filter {} 
files into file slices across {} partitions",
+          metaClient.getTableConfig().getTableName(), 
queryInstant.orElse("N/A"), timer.endTimer(),
+          allFiles.size(), partitions.size());
     }
   }
 
   protected List<PartitionPath> listPartitionPaths(List<String> 
relativePartitionPaths,
                                                    Types.RecordType 
partitionFields,
                                                    Expression 
partitionColumnPredicates) {
     List<String> matchedPartitionPaths;
+    HoodieTimer timer = HoodieTimer.start();
     try {
       matchedPartitionPaths = 
tableMetadata.getPartitionPathWithPathPrefixUsingFilterExpression(relativePartitionPaths,
           partitionFields, partitionColumnPredicates);
     } catch (IOException e) {
       throw new HoodieIOException("Error fetching partition paths", e);
+    } finally {
+      log.debug("On {}, it took {} ms to list partition paths with {} 
relativePartitionPaths and partition predicates",

Review Comment:
   🤖 Worth noting: logging an `Expression` object (via its `toString()`) on a 
hot path can be non-trivial depending on the expression tree depth. If the 
intent is just to document that predicates were involved (rather than surfacing 
the actual predicate value), the current text-only approach is reasonable — but 
the message as written implies the predicate will be printed, which could 
mislead someone reading the logs. Either dropping "and partition predicates" 
from the message text, or adding `partitionColumnPredicates` as a format arg 
(with awareness of the toString cost), would make it unambiguous.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] chore: add timing logs for file index partition and file listing [hudi]

Reply via email to