Re: Load distribution in cluster mode

Joe Witt Mon, 09 Feb 2015 17:37:37 -0800

Yeah makes total sense.  https://issues.apache.org/jira/browse/NIFI-338


Thanks!
Joe

On Mon, Feb 9, 2015 at 6:43 PM, Andrew Purtell <[email protected]> wrote:
>>
>
> *Automated Cluster Load Balancing*
>
> I put your excellent response up as NIFI-337.
>
>>
>
> * Multiple Masters *
>
>>  that single master is solely for command/control of changes to the
> dataflow
>>
> configuration and is a very lightweight process that does nothing more
>>
> than that.  If the master dies then all nodes continue to do what they
>>
> were doing and even site-to-site continues to distribute data.  It
>>
> just does so without updates on current loading across the cluster.
>
> Yes but imagine a NiFi installation, perhaps a hosted service built on top
> of it, where DataFlow Managers expect the command and control aspect of the
> system to be as robust and available as flow processing itself. If one or
> more standby masters are waiting in the wings to take over service for the
> failed active master then automated and unattended failover would be
> possible, and likely to narrow the interval where administrative changes
> may fail.
>
>
>
> On Sun, Feb 8, 2015 at 2:24 PM, Joe Witt <[email protected]> wrote:
>
>> Andrew
>>
>>
>> *Automated Cluster Load Balancing*
>>
>> The processors themselves are available and ready to run on all nodes
>> at all times.  It's really just a question of whether they have data
>> to run on.  We have always taken the view that 'if you want scalable
>> dataflow' use scalable interfaces.  And I think that is the way to go
>> in every case you can pull it off.  That generally meant one should
>> use datasources which offer queueing semantics where multiple
>> independent nodes can pull from the queue with 'at-least-once'
>> guarantees.  In addition each node has back pressure so if it falls
>> behind it slows its rate of pickup which  means other nodes in the
>> cluster can pickup the slack.  This has worked extremely well.
>>
>> That said, I recognize that it simply isn't always possible to use
>> scalable interfaces and given enough non-scalable datasources the
>> cluster could become out of balance.  So this certainly seems like a
>> good [valuable, fun, non-trivial] problem to tackle.  If we allow
>> connections between processors to be auto-balanced then it will make
>> for a pretty smooth experience as users won't really have to think too
>> much about it.
>>
>>
>> So the key will be how to devise and implement an algorithm or
>> approach to spreading that load intelligently and so data doesn't just
>> bounce back and forth.  If anyone knows of good papers, similar
>> systems, or approaches they can describe for how to think through this
>> that would be great.  Things we'll have to think about here that come
>> to mind:
>>
>> - When to start spreading the load (at what factor should we start
>> spreading work across the cluster)
>>
>> - Whether it should auto-spread by default and the user can tell it
>> not to in certain cases or whether it should not spread by default and
>> the user can activate it
>>
>> - What the criteria are by which we should let a user control how data
>> is partitioned (some key, round robin, etc..).  How to
>> rebalance/re-assign partitions if a node dies or comes on-line
>>
>> There are 'counter cases' too that we must keep in mind such as
>> aggregation or bin packing grouped by some key.  In those cases all
>> data would need to be merged together at some point and thus all data
>> needs to be accessible at some point. Whether that means we direct all
>> data to a single node or whether we enable cross-cluster data
>> addressing is also a topic there.
>>
>>
>> * Multiple Masters *
>>
>> So it is true that we have a single master model today.  But that
>> single master is solely for command/control of changes to the dataflow
>> configuration and is a very lightweight process that does nothing more
>> than that.  If the master dies then all nodes continue to do what they
>> were doing and even site-to-site continues to distribute data.  It
>> just does so without updates on current loading across the cluster.
>> Once the master is brought back on-line then the real-time command and
>> control functions return.  Building support for a back-up master to
>> offer HA of even the command/control side would probably also be a
>> considerable effort.  This one I'd be curious to hear of cases where
>> it was critical to make this part HA.
>>
>> Excited to be talking about this level of technical detail.
>>
>> Thanks
>> Joe
>>
>
>
>
> --
> Best regards,
>
>    - Andy
>
> Problems worthy of attack prove their worth by hitting back. - Piet Hein
> (via Tom White)

Re: Load distribution in cluster mode

Reply via email to