Re: [zeromq-dev] Handling network failures in parallel pipeline ventilator?

Joe Planisky Fri, 24 Aug 2012 07:09:21 -0700

Thanks for the ideas, Andrew. 

--
Joe


On Aug 23, 2012, at 2:16 PM, Andrew Hume wrote:

> repeat after me:
> 
>       a load-balancing fair-share message routing system is NOT a job 
> scheduler.
> 
> it nearly can do a related thing, but its not.
> there are several ways to do this; i'm sure teh guide covers a couple.
> i would normally handle this one of two ways:
> 
> a) if tasks are expensive, then don't push tasks around. have workers ask for 
> a task
>       (e.g. using REQ/REP) one at a time
> b) if tasks are inexpensive (and efficiency matters), then shovel requests 
> via PUSH
>       at the workers (who PULL) using a modest HWM. if one worker gets 1000 
> tasks
>       and another gets 10 (because there were only 1010), who cares? the 
> tasks are cheap.
> 
> in each case, when a task gets sent, it gets a timestamp and a timeout. 
> workers PUSH back an
> acknowledgement when they complete a task; the scheduler process marks it as 
> done and
> when a task times out, you schedule it again.
> this should be enough to do what you describe.
> 
> admittedly, there is a single point of failure in the scheduler. but it is 
> paid for by simplicity.
> (and by frequent checkpointing, you can mitigate the effects of scheduler 
> failure.)
> 
>       andrew
> 
> On Aug 23, 2012, at 2:01 PM, Joe Planisky wrote:
> 
>> I'm a little stumped about how to handle network failures in a system that 
>> uses PUSH/PULL sockets as in the Parallel Pipeline case in chapter 2 of The 
>> Guide.
>> 
>> As in the Guide, suppose my ventilator is pushing tasks to 3 workers.  It 
>> doesn't matter which task gets pushed to which worker, but it's very 
>> important that all tasks eventually get sent to a worker.  
>> 
>> Everything is working fine; tasks are being load balanced to workers, 
>> workers are doing their thing and sending the results on to a sink.  Now 
>> suppose there's a network failure between the ventilator and one of the 
>> workers. Suppose the ethernet cable to one of the worker machines is 
>> unplugged.
>> 
>> Based on what we've seen in practice, the ventilator socket will still 
>> attempt to push some number of tasks to the now disconnected worker before 
>> realizing there's a problem.  Tasks intended for that worker start backing 
>> up, presumably in ZMQ buffers and/or in buffers in the underlying OS (Ubuntu 
>> 10.04 in our case).  Eventually, the PUSH socket figures out that something 
>> is wrong and stops trying to send additional tasks to that worker. All new 
>> tasks are then load balanced to the remaining workers.  
>> 
>> However, the tasks that are queued up for the disconnected worker are stuck 
>> and are never sent anywhere unless or until the original worker comes back 
>> online.  If the original worker never comes back, those tasks never get 
>> executed. (If it does come back, it gets a burst of all the backed up tasks 
>> and the PUSH socket resumes load balancing new tasks to all 3 workers.)
>> 
>> We'd like to prevent this backup from happening or at least minimize the 
>> number of tasks that get stuck.  We've tried setting high water marks, send 
>> and receive timeouts, and send and receive buffer sizes in ZMQ to small 
>> values (e.g. 1) hoping that it would cause the PUSH socket to notice the 
>> problem sooner, but at best we still get several dozen task messages backed 
>> up before the socket notices the problem and stops trying.  (Our task 
>> messages are small, about 520 bytes each.)
>> 
>> If we have to, we can deal with the same task getting sent to more than one 
>> worker on an occasional basis, but we'd like to avoid that if possible.
>> 
>> We're using ZMQ 2.2.0, but are also investigating the 3.2.0 release 
>> candidate.  If it matters, we're accessing ZMQ with Java using the jzmq 
>> bindings.  The underlying OS is Ubuntu 10.04.
>> 
>> Any suggestions for how to deal with this?
>> 
>> --
>> Joe
>> 
>> 
>> _______________________________________________
>> zeromq-dev mailing list
>> [email protected]
>> http://lists.zeromq.org/mailman/listinfo/zeromq-dev
> 
> 
> ------------------
> Andrew Hume  (best -> Telework) +1 623-551-2845
> [email protected]  (Work) +1 973-236-2014
> AT&T Labs - Research; member of USENIX and LOPSA
> 
> 
> 
> 
> _______________________________________________
> zeromq-dev mailing list
> [email protected]
> http://lists.zeromq.org/mailman/listinfo/zeromq-dev

_______________________________________________
zeromq-dev mailing list
[email protected]
http://lists.zeromq.org/mailman/listinfo/zeromq-dev

Re: [zeromq-dev] Handling network failures in parallel pipeline ventilator?

Reply via email to