Hello Ankit,

We also met the issue and currently solved it using pre-stop kubernetes
feature, for a pod to clean everything that is running before stoping,
using a shell script and sql.

It is effective for the big part, but sometimes it happens that an
instance took a new job at the end of the pre-stop script.

We added a way for our instance to ask what are the pod ids that are
currently running to clean those remaining jobs. Now everything is ok,
but not so clean.

I'm please to read your ideas to solve this issue, and i think it goes
the good way.

Nice one, thanks !

[...]

> As a *next* step, I think the out-of-the-box Job Poller should *itself *be
> able to validate and handle such stale jobs and re-assign them to the other
> active node for further processing. For this, I propose implementing a *Lease
> + Heartbeat based job ownership *approach could be helpful here. This
> validation method will include 3 steps:
> 
> *#1 Assigning the node as the Job owner *
> -- Assign the individual node identiifer (instance-id) as the owner for all
> jobs it is running (*runByInstanceId*) along with a new custom field (
> *JobSandbox.leaseUpdatedStamp*) that will help the Job poller track the
> last time the lease was updated by the node, confirming the node was still
> active at that time.
> 
> *#2 Heartbeat / Lease Renewal*
> -- At a configured interval, the Job Poller running on each node will
> update the lease timestamp for the open/in-progress jobs that the node
> currently owns.
> 
> *#3 Lease Expiry Validation*
> The JobPoller running on each active node will also periodically validate
> whether all the jobs owned by that node itself are actively updating their
> heartbeat within the specified threshold. Any job that fails to update its
> heartbeat within the given threshold will be considered owned by a stale
> node and will be eligible for recovery. Job poller will release such stale
> jobs identified, making them available for other active nodes to pick.
> 
> *Proposed time frequency/intervals:*
> *-- **Lease update* Interval*: **every 5 minutes*
> *-- Lease Expiry* Threshold: *10 minutes*
> *-- Lease Expiry validation* : every *8 minutes*
> 
> *Points to consider*
> -- Each node should have unique node identifier (runByInstanceId) that will
> help to track/validate aliveness for each individual node.
> -- The time intervals suggested above could also be added as a configurable
> option via data.
> 
> Looking forward to valuable thoughts on it. I'll create a Jira ticket for
> this and will update the details there according to the inputs.
> 
> Thanks & Regards,
> Ankit Joshi

Reply via email to