yihua commented on code in PR #18417:
URL: https://github.com/apache/hudi/pull/18417#discussion_r3055149157
##########
hudi-common/src/main/java/org/apache/hudi/BaseHoodieTableFileIndex.java:
##########
@@ -385,7 +394,13 @@ protected List<PartitionPath>
listPartitionPaths(List<String> relativePartitionP
HoodieTimeline timelineToQuery = findInstantsInRange();
matchedPartitionPaths =
TimelineUtils.getWrittenPartitions(timelineToQuery);
} else {
Review Comment:
🤖 Good observation. `TimelineUtils.getWrittenPartitions` just scans the
already-loaded in-memory timeline — there's no I/O or MDT call involved, so
it's very unlikely to be a latency contributor. Timing it would add noise
without much diagnostic value, so leaving it untimed seems intentional and
reasonable.
##########
hudi-common/src/main/java/org/apache/hudi/BaseHoodieTableFileIndex.java:
##########
@@ -476,6 +491,12 @@ private List<StoragePathInfo>
listPartitionPathFiles(List<PartitionPath> partiti
Set<StoragePath> missingPartitionPaths =
CollectionUtils.diffSet(partitionPaths, cachedPartitionPaths.keySet());
+ if (missingPartitionPaths.isEmpty()) {
+ return cachedPartitionPaths.values().stream()
+ .flatMap(Collection::stream)
+ .collect(Collectors.toList());
+ }
+
Review Comment:
🤖 Agreed that the PR description needs updating. On the equivalence
question: when all paths are cached the original code would have called
`getAllFilesInPartitions` with an empty keyset and then merged the (empty)
result into `cachedPartitionPaths` values — so the returned list is identical.
The early return is safe, but it does skip that empty-set call entirely, which
could matter if any `tableMetadata` implementation has side effects on an empty
invocation (metrics recording, etc.). Worth a note in the description that this
is an intentional functional optimization, not just observability.
##########
hudi-common/src/main/java/org/apache/hudi/BaseHoodieTableFileIndex.java:
##########
@@ -357,18 +358,26 @@ private Map<PartitionPath, List<FileSlice>>
filterFiles(List<PartitionPath> part
.orElseGet(() ->
finalFileSystemView.getLatestFileSlices(partitionPath.path))
.collect(Collectors.toList())
));
+ } finally {
+ log.debug("On {} with query instant as {}, it took {}ms to filter {}
files into file slices across {} partitions",
+ metaClient.getTableConfig().getTableName(),
queryInstant.orElse("N/A"), timer.endTimer(),
+ allFiles.size(), partitions.size());
}
}
protected List<PartitionPath> listPartitionPaths(List<String>
relativePartitionPaths,
Types.RecordType
partitionFields,
Expression
partitionColumnPredicates) {
List<String> matchedPartitionPaths;
+ HoodieTimer timer = HoodieTimer.start();
try {
matchedPartitionPaths =
tableMetadata.getPartitionPathWithPathPrefixUsingFilterExpression(relativePartitionPaths,
partitionFields, partitionColumnPredicates);
} catch (IOException e) {
throw new HoodieIOException("Error fetching partition paths", e);
+ } finally {
+ log.debug("On {}, it took {} ms to list partition paths with {}
relativePartitionPaths and partition predicates",
Review Comment:
🤖 Worth noting: logging an `Expression` object (via its `toString()`) on a
hot path can be non-trivial depending on the expression tree depth. If the
intent is just to document that predicates were involved (rather than surfacing
the actual predicate value), the current text-only approach is reasonable — but
the message as written implies the predicate will be printed, which could
mislead someone reading the logs. Either dropping "and partition predicates"
from the message text, or adding `partitionColumnPredicates` as a format arg
(with awareness of the toString cost), would make it unambiguous.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]