justinmclean opened a new issue, #10623:
URL: https://github.com/apache/gravitino/issues/10623

   ### What would you like to be improved?
   
   JobManager.cleanUpStagingDirs() deletes the persisted job record before 
deleting that job’s staging directory. If FileUtils.deleteDirectory(...) then 
fails with an IOException, the code only logs the error and continues. Because 
later cleanup runs discover candidates from entityStore.list(...) rather than 
by scanning the filesystem, that staging directory is no longer associated with 
any stored job and may be left behind indefinitely.
   
   This code path is production-reachable because cleanUpStagingDirs() is 
scheduled automatically by JobManager during normal runtime.
   
   ### How should we improve?
   
   Reverse the cleanup order so the staging directory is deleted first, and 
only delete the job entity after filesystem cleanup succeeds. That keeps failed 
deletions retryable on the next scheduled cleanup run.
   
   If deleting the entity first is required for another reason, an alternative 
is to add a fallback cleanup path that scans the staging directory tree and 
removes orphaned job directories even when the corresponding job entity is 
already gone.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to