Yeah makes total sense. https://issues.apache.org/jira/browse/NIFI-338
Thanks! Joe On Mon, Feb 9, 2015 at 6:43 PM, Andrew Purtell <[email protected]> wrote: >> > > *Automated Cluster Load Balancing* > > I put your excellent response up as NIFI-337. > >> > > * Multiple Masters * > >> that single master is solely for command/control of changes to the > dataflow >> > configuration and is a very lightweight process that does nothing more >> > than that. If the master dies then all nodes continue to do what they >> > were doing and even site-to-site continues to distribute data. It >> > just does so without updates on current loading across the cluster. > > Yes but imagine a NiFi installation, perhaps a hosted service built on top > of it, where DataFlow Managers expect the command and control aspect of the > system to be as robust and available as flow processing itself. If one or > more standby masters are waiting in the wings to take over service for the > failed active master then automated and unattended failover would be > possible, and likely to narrow the interval where administrative changes > may fail. > > > > On Sun, Feb 8, 2015 at 2:24 PM, Joe Witt <[email protected]> wrote: > >> Andrew >> >> >> *Automated Cluster Load Balancing* >> >> The processors themselves are available and ready to run on all nodes >> at all times. It's really just a question of whether they have data >> to run on. We have always taken the view that 'if you want scalable >> dataflow' use scalable interfaces. And I think that is the way to go >> in every case you can pull it off. That generally meant one should >> use datasources which offer queueing semantics where multiple >> independent nodes can pull from the queue with 'at-least-once' >> guarantees. In addition each node has back pressure so if it falls >> behind it slows its rate of pickup which means other nodes in the >> cluster can pickup the slack. This has worked extremely well. >> >> That said, I recognize that it simply isn't always possible to use >> scalable interfaces and given enough non-scalable datasources the >> cluster could become out of balance. So this certainly seems like a >> good [valuable, fun, non-trivial] problem to tackle. If we allow >> connections between processors to be auto-balanced then it will make >> for a pretty smooth experience as users won't really have to think too >> much about it. >> >> >> So the key will be how to devise and implement an algorithm or >> approach to spreading that load intelligently and so data doesn't just >> bounce back and forth. If anyone knows of good papers, similar >> systems, or approaches they can describe for how to think through this >> that would be great. Things we'll have to think about here that come >> to mind: >> >> - When to start spreading the load (at what factor should we start >> spreading work across the cluster) >> >> - Whether it should auto-spread by default and the user can tell it >> not to in certain cases or whether it should not spread by default and >> the user can activate it >> >> - What the criteria are by which we should let a user control how data >> is partitioned (some key, round robin, etc..). How to >> rebalance/re-assign partitions if a node dies or comes on-line >> >> There are 'counter cases' too that we must keep in mind such as >> aggregation or bin packing grouped by some key. In those cases all >> data would need to be merged together at some point and thus all data >> needs to be accessible at some point. Whether that means we direct all >> data to a single node or whether we enable cross-cluster data >> addressing is also a topic there. >> >> >> * Multiple Masters * >> >> So it is true that we have a single master model today. But that >> single master is solely for command/control of changes to the dataflow >> configuration and is a very lightweight process that does nothing more >> than that. If the master dies then all nodes continue to do what they >> were doing and even site-to-site continues to distribute data. It >> just does so without updates on current loading across the cluster. >> Once the master is brought back on-line then the real-time command and >> control functions return. Building support for a back-up master to >> offer HA of even the command/control side would probably also be a >> considerable effort. This one I'd be curious to hear of cases where >> it was critical to make this part HA. >> >> Excited to be talking about this level of technical detail. >> >> Thanks >> Joe >> > > > > -- > Best regards, > > - Andy > > Problems worthy of attack prove their worth by hitting back. - Piet Hein > (via Tom White)
