Re: [zeromq-dev] Handling network failures in parallel pipeline ventilator?

Andrew Hume Thu, 23 Aug 2012 14:16:55 -0700

repeat after me:

        a load-balancing fair-share message routing system is NOT a job 
scheduler.


it nearly can do a related thing, but its not.
there are several ways to do this; i'm sure teh guide covers a couple.
i would normally handle this one of two ways:

a) if tasks are expensive, then don't push tasks around. have workers ask for a 
task
        (e.g. using REQ/REP) one at a time
b) if tasks are inexpensive (and efficiency matters), then shovel requests via 
PUSH
        at the workers (who PULL) using a modest HWM. if one worker gets 1000 
tasks
        and another gets 10 (because there were only 1010), who cares? the 
tasks are cheap.

in each case, when a task gets sent, it gets a timestamp and a timeout. workers 
PUSH back an
acknowledgement when they complete a task; the scheduler process marks it as 
done and
when a task times out, you schedule it again.
this should be enough to do what you describe.

admittedly, there is a single point of failure in the scheduler. but it is paid 
for by simplicity.
(and by frequent checkpointing, you can mitigate the effects of scheduler 
failure.)

        andrew

On Aug 23, 2012, at 2:01 PM, Joe Planisky wrote:

> I'm a little stumped about how to handle network failures in a system that 
> uses PUSH/PULL sockets as in the Parallel Pipeline case in chapter 2 of The 
> Guide.
> 
> As in the Guide, suppose my ventilator is pushing tasks to 3 workers.  It 
> doesn't matter which task gets pushed to which worker, but it's very 
> important that all tasks eventually get sent to a worker.  
> 
> Everything is working fine; tasks are being load balanced to workers, workers 
> are doing their thing and sending the results on to a sink.  Now suppose 
> there's a network failure between the ventilator and one of the workers. 
> Suppose the ethernet cable to one of the worker machines is unplugged.
> 
> Based on what we've seen in practice, the ventilator socket will still 
> attempt to push some number of tasks to the now disconnected worker before 
> realizing there's a problem.  Tasks intended for that worker start backing 
> up, presumably in ZMQ buffers and/or in buffers in the underlying OS (Ubuntu 
> 10.04 in our case).  Eventually, the PUSH socket figures out that something 
> is wrong and stops trying to send additional tasks to that worker. All new 
> tasks are then load balanced to the remaining workers.  
> 
> However, the tasks that are queued up for the disconnected worker are stuck 
> and are never sent anywhere unless or until the original worker comes back 
> online.  If the original worker never comes back, those tasks never get 
> executed. (If it does come back, it gets a burst of all the backed up tasks 
> and the PUSH socket resumes load balancing new tasks to all 3 workers.)
> 
> We'd like to prevent this backup from happening or at least minimize the 
> number of tasks that get stuck.  We've tried setting high water marks, send 
> and receive timeouts, and send and receive buffer sizes in ZMQ to small 
> values (e.g. 1) hoping that it would cause the PUSH socket to notice the 
> problem sooner, but at best we still get several dozen task messages backed 
> up before the socket notices the problem and stops trying.  (Our task 
> messages are small, about 520 bytes each.)
> 
> If we have to, we can deal with the same task getting sent to more than one 
> worker on an occasional basis, but we'd like to avoid that if possible.
> 
> We're using ZMQ 2.2.0, but are also investigating the 3.2.0 release 
> candidate.  If it matters, we're accessing ZMQ with Java using the jzmq 
> bindings.  The underlying OS is Ubuntu 10.04.
> 
> Any suggestions for how to deal with this?
> 
> --
> Joe
> 
> 
> _______________________________________________
> zeromq-dev mailing list
> [email protected]
> http://lists.zeromq.org/mailman/listinfo/zeromq-dev


------------------
Andrew Hume  (best -> Telework) +1 623-551-2845
[email protected]  (Work) +1 973-236-2014
AT&T Labs - Research; member of USENIX and LOPSA

_______________________________________________
zeromq-dev mailing list
[email protected]
http://lists.zeromq.org/mailman/listinfo/zeromq-dev

Re: [zeromq-dev] Handling network failures in parallel pipeline ventilator?

Reply via email to