codope commented on code in PR #18405:
URL: https://github.com/apache/hudi/pull/18405#discussion_r3034934604


##########
hudi-utilities/src/main/java/org/apache/hudi/utilities/streamer/StreamSync.java:
##########
@@ -872,6 +873,10 @@ private Pair<Option<String>, JavaRDD<WriteStatus>> 
writeToSinkAndDoMetaSync(Hood
           totalSuccessfulRecords);
       String commitActionType = CommitUtils.getCommitActionType(cfg.operation, 
HoodieTableType.valueOf(cfg.tableType));
 
+      // Run pre-commit streaming offset validators (if configured) before 
commit
+      SparkStreamerValidatorUtils.runValidators(props, instantTime, 
writeStatusRDD,
+          checkpointCommitMetadata, metaClient);

Review Comment:
   > offset validation is a stronger guard than commitOnErrors
   
   I agree technically. However, I am coming from user's point of view. 
Consider:
   
     - Kafka offset diff = 1000
     - 800 records written successfully, 200 write tasks failed
     - User has commitOnErrors=true (they explicitly accept partial failures)
     - The validator sees 20% deviation --> throws HoodieValidationException 
("data loss detected")
     - The real issue is write failures, not data loss; the error message is 
misleading
     - The user's explicit `commitOnErrors=true` policy is overridden silently
   
   The validator conflates two distinct failure modes: (a) records silently 
dropped by a bug (real data loss) vs (b) records that failed to write (tracked 
write errors). The existing `commitOnErrors` logic is designed for case (b). 
The validator should either:
     - Filter out write-error statuses before counting records, or
     - At minimum, log the error count alongside the deviation so users can 
distinguish the two cases.
    Wdyt?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to