Hello all,

I would like to share a *Job Poller *related real world use-case facing
with OFBiz based application having multiple nodes with auto-scaling
enabled.

*Problem scenario:*
For an OFBiz environment running with multiple auto-scaling nodes, an
individual node may be replaced or restarted due to infrastructure failure,
autoscaling relaunch, unhealthy instance recovery, etc.

In such a case, the job assigned/running for that specific node (JobSandbox.
*runByInstanceId* which helps the Job Poller ensure/manage job assignments)
got stuck and doesn't pick by any other node unless the runByInstanceId got
cleared manually.

This became a *significant* problem for the systems dealing with high
traffic and effectively using the Auto-Scaling relauch mechanism.

*Solution proposal:*
To overcome this problem, I first worked toward creating a separate ASync
server for jobs management that will be out of ASG environment. However,
this should not be a good approach as it includes an additional
infrastructure *pricing *and additional *operational* overhead for the
client to manage those additional servers.

As a *next* step, I think the out-of-the-box Job Poller should *itself *be
able to validate and handle such stale jobs and re-assign them to the other
active node for further processing. For this, I propose implementing a *Lease
+ Heartbeat based job ownership *approach could be helpful here. This
validation method will include 3 steps:

*#1 Assigning the node as the Job owner *
-- Assign the individual node identiifer (instance-id) as the owner for all
jobs it is running (*runByInstanceId*) along with a new custom field (
*JobSandbox.leaseUpdatedStamp*) that will help the Job poller track the
last time the lease was updated by the node, confirming the node was still
active at that time.

*#2 Heartbeat / Lease Renewal*
-- At a configured interval, the Job Poller running on each node will
update the lease timestamp for the open/in-progress jobs that the node
currently owns.

*#3 Lease Expiry Validation*
The JobPoller running on each active node will also periodically validate
whether all the jobs owned by that node itself are actively updating their
heartbeat within the specified threshold. Any job that fails to update its
heartbeat within the given threshold will be considered owned by a stale
node and will be eligible for recovery. Job poller will release such stale
jobs identified, making them available for other active nodes to pick.

*Proposed time frequency/intervals:*
*-- **Lease update* Interval*: **every 5 minutes*
*-- Lease Expiry* Threshold: *10 minutes*
*-- Lease Expiry validation* : every *8 minutes*

*Points to consider*
-- Each node should have unique node identifier (runByInstanceId) that will
help to track/validate aliveness for each individual node.
-- The time intervals suggested above could also be added as a configurable
option via data.

Looking forward to valuable thoughts on it. I'll create a Jira ticket for
this and will update the details there according to the inputs.

Thanks & Regards,
Ankit Joshi

Reply via email to