yihua commented on code in PR #18476:
URL: https://github.com/apache/hudi/pull/18476#discussion_r3043520609
##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/clustering/run/strategy/MultipleSparkJobExecutionStrategy.java:
##########
@@ -427,7 +426,6 @@ private Dataset<Row>
readRecordsForGroupAsRow(JavaSparkContext jsc,
HashMap<String, String> params = new HashMap<>();
Review Comment:
🤖 Could you confirm that `createRelation` with explicit
`hoodie.datasource.read.paths` + `glob.paths` truly suppresses Hudi's
file-group-based log file auto-discovery for MoR? My concern is that if the
relation still scans for *all* log files belonging to a file group (rather than
only the ones in `paths`), a concurrent completed commit that wrote new log
files to the same file group between clustering scheduling and execution could
get silently included — producing a clustered output that covers a wider time
range than the plan intended.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]