arjnklc opened a new pull request, #10814:
URL: https://github.com/apache/gravitino/pull/10814

   ### What changes were proposed in this pull request?
   
   Added compensation logic in `JobManager.runJob` to cancel submitted jobs 
when entity persistence fails:
   - When `entityStore.put()` throws `IOException` after a successful 
`jobExecutor.submitJob()`, the method now calls 
`jobExecutor.cancelJob(jobExecutionId)` to roll back the orphaned execution.
   - Logs both successful and failed rollback attempts.
   - Best-effort cleanup of the staging directory via 
`FileUtils.deleteDirectory()`.
   - The original `RuntimeException` is still re-thrown, preserving existing 
failure semantics.
   
   ### Why are the changes needed?
   
   `JobManager.runJob` submits jobs to `jobExecutor` before persisting 
`JobEntity` in `entityStore`. If the persistence operation throws 
`IOException`, the submitted job continues running as an orphaned execution 
that cannot be tracked or managed. This change makes submit+persist effectively 
atomic from the caller's perspective.
   
   Fix: #10271
   
   ### Does this PR introduce _any_ user-facing change?
   
   No. This is an internal robustness improvement. The API behavior remains the 
same, the caller still receives a `RuntimeException` on persistence failure.
   
   ### How was this patch tested?
   
   Updated existing unit test in `TestJobManager.testRunJob` to verify rollback 
behavior:
   ```
   ./gradlew :core:test --tests "org.apache.gravitino.job.TestJobManager" 
-PskipITs
   ```
   The test verifies that when `entityStore.put()` fails, 
`jobExecutor.cancelJob()` is invoked exactly once with the correct execution ID.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to