Andrew To be clear you are advocating we knock out an automated load distribution across nodes in the cluster as data passes through some point on the flow? This would be so that someone would even need to add a remote process group for it to happen.
Cool if so just want to be sure that is also what you were advocating. Thanks Joe On Feb 8, 2015 1:48 PM, "Andrew Purtell" <[email protected]> wrote: > > But do you think this is something we should tackle soon (based on real > scenarios you face)? > > +1 > > I was looking at the user guide as a first timer. Because I had somehow an > assumption that a NiFi cluster would transparently scale out (probably due > to Storm and Spark Streaming), in the sense that I would define > logical flows but there would then be a 'hidden' scale out physical plan > that emerges from it, having to explicitly do this with remote process > groups was a surprise. The resulting disjoint flows complicate what would > otherwise be a clean and easy to follow flow design? When I used to do flow > based programming with Cascading I'd define logical assemblies and its > topographical planner ( > http://docs.cascading.org/cascading/2.6/userguide/html/ch14s02.html) would > automatically identify the parallelism possible in the design and > distribute it. > > > > On Friday, February 6, 2015, Joe Witt <[email protected]> wrote: > > > Ricky, > > > > So the use case you're coming from here is a good and common one which > is: > > > > If I have a datasource which does not offer scalabilty (it can only > > send to a single node for instance) but I have a scalable distribution > > cluster what are my options? > > > > So today you can accept the data on a single node then immediate do as > > Mark describes and fire it to a "Remote Process Group" addressing the > > cluster itself. That way NiFi will automatically figure out all the > > nodes in the cluster and spread the data around factoring in > > load/etc.. But we do want to establish an even more automatic > > mechanism on a connection itself where the user can indicate the data > > should be auto-balanced. > > > > The reverse is really true as well where you can have a consumer which > > only wants to accept from a single host. So there too we need a > > mechanism to descale the approach. > > > > I realize the flow you're working with now is just a sort of > > familiarization thing. But do you think this is something we should > > tackle soon (based on real scenarios you face)? > > > > Thanks > > Joe > > > > On Fri, Feb 6, 2015 at 3:07 PM, Mark Payne <[email protected] > > <javascript:;>> wrote: > > > Ricky, > > > > > > > > > > > > > > > I don’t think there’s a JIRA ticket currently. Feel free to create one. > > > > > > > > > > > > > > > I think we may need to do a better job documenting how the Remote > > Process Groups. If you have a cluster setup, you would add a Remote > Process > > Group that points to the Cluster Manager. (I.e., the URL that you connect > > to in order to see the graph). > > > > > > > > > Then, anything that you send to the Remote Process Group will > > automatically get load-balanced across all of the nodes in the cluster. > So > > you could setup a flow that looks something like: > > > > > > > > > GenerateFlowFile -> RemoteProcessGroup > > > > > > > > > Input Port -> HashContent > > > > > > > > > So these 2 flows are disjoint. The first part generates data and then > > distributes it to the cluster (when you connect to the Remote Process > > Group, you choose which Input Port to send to). > > > > > > > > > But what we’d like to do in the future is something like: > > > > > > > > > GenerateFlowFile -> HashContent > > > > > > > > > And then on the connection in the middle choose to auto-distribute the > > data. Right now you have to put the Remote Process Group in there to > > distribute to the cluster, and add the Input Port to receive the data. > But > > there should only be a single RemoteProcessGroup that points to the > entire > > cluster, not one per node. > > > > > > > > > Thanks > > > > > > -Mark > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > From: Ricky Saltzer > > > Sent: Friday, February 6, 2015 3:06 PM > > > To: [email protected] <javascript:;> > > > > > > > > > > > > > > > > > > Mark - > > > > > > Thanks for the fast reply, much appreciated. This is what I figured, > but > > > since I was already in clustered mode, I wanted to make sure there > wasn't > > > an easier way than adding each node as a remote process group. > > > > > > Is there already a JIRA to track the ability to auto distribute in > > > clustered mode, or would you like me to open it up? > > > > > > Thanks again, > > > Ricky > > > > > > On Fri, Feb 6, 2015 at 2:58 PM, Mark Payne <[email protected] > > <javascript:;>> wrote: > > > > > >> Ricky, > > >> > > >> > > >> The DistributeLoad processor is simply used to route to one of many > > >> relationships. So if you have, for instance, 5 different servers that > > you > > >> can FTP files to, you can use DistributeLoad to round robin the files > > >> between them, so that you end up pushing 20% to each of 5 PutFTP > > processors. > > >> > > >> > > >> What you’re wanting to do, it sounds like, is to distribute the > > FlowFiles > > >> to different nodes in the cluster. The Remote Process Group is how you > > >> would need to do that at this time. We have discussed having the > > ability to > > >> mark a Connection as “Auto-Distributed” (or maybe some better name 😊) > > and > > >> have that automatically distribute the data between nodes in the > > cluster, > > >> but that feature hasn’t yet been implemented. > > >> > > >> > > >> Does that answer your question? > > >> > > >> > > >> Thanks > > >> > > >> -Mark > > >> > > >> > > >> > > >> > > >> > > >> > > >> From: Ricky Saltzer > > >> Sent: Friday, February 6, 2015 2:56 PM > > >> To: [email protected] <javascript:;> > > >> > > >> > > >> > > >> > > >> > > >> Hi - > > >> > > >> I have a question regarding load distribution in a clustered NiFi > > >> environment. I have a really simple example, I'm using the > > GenerateFlowFile > > >> processor to generate some random data, then I MD5 hash the file and > > print > > >> out the resulting hash. > > >> > > >> I want only the primary node to generate the data, but I want both > > nodes in > > >> the cluster to share the hashing workload. It appears if I set the > > >> scheduling strategy to "On primary node" for the GenerateFlowFile > > >> processor, then the next processor (HashContent) is only being > accepted > > and > > >> processed by a single node. > > >> > > >> I've put DistributeLoad processor in-between the HashContent and > > >> GenerateFlowFile, but this requires me to use the remote process group > > to > > >> distribute the load, which doesn't seem intuitive when I'm already > > >> clustered. > > >> > > >> I guess my question is, is it possible for the DistributeLoad > processor > > to > > >> understand that NiFi is in a clustered environment, and have an > ability > > to > > >> distribute the next processor (HashContent) amongst all nodes in the > > >> cluster? > > >> > > >> Cheers, > > >> -- > > >> Ricky Saltzer > > >> http://www.cloudera.com > > >> > > > > > > > > > > > > -- > > > Ricky Saltzer > > > http://www.cloudera.com > > > > > -- > Best regards, > > - Andy > > Problems worthy of attack prove their worth by hitting back. - Piet Hein > (via Tom White) >
