Hello all, I would like to share a *Job Poller *related real world use-case facing with OFBiz based application having multiple nodes with auto-scaling enabled.
*Problem scenario:* For an OFBiz environment running with multiple auto-scaling nodes, an individual node may be replaced or restarted due to infrastructure failure, autoscaling relaunch, unhealthy instance recovery, etc. In such a case, the job assigned/running for that specific node (JobSandbox. *runByInstanceId* which helps the Job Poller ensure/manage job assignments) got stuck and doesn't pick by any other node unless the runByInstanceId got cleared manually. This became a *significant* problem for the systems dealing with high traffic and effectively using the Auto-Scaling relauch mechanism. *Solution proposal:* To overcome this problem, I first worked toward creating a separate ASync server for jobs management that will be out of ASG environment. However, this should not be a good approach as it includes an additional infrastructure *pricing *and additional *operational* overhead for the client to manage those additional servers. As a *next* step, I think the out-of-the-box Job Poller should *itself *be able to validate and handle such stale jobs and re-assign them to the other active node for further processing. For this, I propose implementing a *Lease + Heartbeat based job ownership *approach could be helpful here. This validation method will include 3 steps: *#1 Assigning the node as the Job owner * -- Assign the individual node identiifer (instance-id) as the owner for all jobs it is running (*runByInstanceId*) along with a new custom field ( *JobSandbox.leaseUpdatedStamp*) that will help the Job poller track the last time the lease was updated by the node, confirming the node was still active at that time. *#2 Heartbeat / Lease Renewal* -- At a configured interval, the Job Poller running on each node will update the lease timestamp for the open/in-progress jobs that the node currently owns. *#3 Lease Expiry Validation* The JobPoller running on each active node will also periodically validate whether all the jobs owned by that node itself are actively updating their heartbeat within the specified threshold. Any job that fails to update its heartbeat within the given threshold will be considered owned by a stale node and will be eligible for recovery. Job poller will release such stale jobs identified, making them available for other active nodes to pick. *Proposed time frequency/intervals:* *-- **Lease update* Interval*: **every 5 minutes* *-- Lease Expiry* Threshold: *10 minutes* *-- Lease Expiry validation* : every *8 minutes* *Points to consider* -- Each node should have unique node identifier (runByInstanceId) that will help to track/validate aliveness for each individual node. -- The time intervals suggested above could also be added as a configurable option via data. Looking forward to valuable thoughts on it. I'll create a Jira ticket for this and will update the details there according to the inputs. Thanks & Regards, Ankit Joshi
