justinmclean opened a new issue, #10623: URL: https://github.com/apache/gravitino/issues/10623
### What would you like to be improved? JobManager.cleanUpStagingDirs() deletes the persisted job record before deleting that job’s staging directory. If FileUtils.deleteDirectory(...) then fails with an IOException, the code only logs the error and continues. Because later cleanup runs discover candidates from entityStore.list(...) rather than by scanning the filesystem, that staging directory is no longer associated with any stored job and may be left behind indefinitely. This code path is production-reachable because cleanUpStagingDirs() is scheduled automatically by JobManager during normal runtime. ### How should we improve? Reverse the cleanup order so the staging directory is deleted first, and only delete the job entity after filesystem cleanup succeeds. That keeps failed deletions retryable on the next scheduled cleanup run. If deleting the entity first is required for another reason, an alternative is to add a fallback cleanup path that scans the staging directory tree and removes orphaned job directories even when the corresponding job entity is already gone. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
