> But do you think this is something we should tackle soon (based on real scenarios you face)?
+1 I was looking at the user guide as a first timer. Because I had somehow an assumption that a NiFi cluster would transparently scale out (probably due to Storm and Spark Streaming), in the sense that I would define logical flows but there would then be a 'hidden' scale out physical plan that emerges from it, having to explicitly do this with remote process groups was a surprise. The resulting disjoint flows complicate what would otherwise be a clean and easy to follow flow design? When I used to do flow based programming with Cascading I'd define logical assemblies and its topographical planner ( http://docs.cascading.org/cascading/2.6/userguide/html/ch14s02.html) would automatically identify the parallelism possible in the design and distribute it. On Friday, February 6, 2015, Joe Witt <[email protected]> wrote: > Ricky, > > So the use case you're coming from here is a good and common one which is: > > If I have a datasource which does not offer scalabilty (it can only > send to a single node for instance) but I have a scalable distribution > cluster what are my options? > > So today you can accept the data on a single node then immediate do as > Mark describes and fire it to a "Remote Process Group" addressing the > cluster itself. That way NiFi will automatically figure out all the > nodes in the cluster and spread the data around factoring in > load/etc.. But we do want to establish an even more automatic > mechanism on a connection itself where the user can indicate the data > should be auto-balanced. > > The reverse is really true as well where you can have a consumer which > only wants to accept from a single host. So there too we need a > mechanism to descale the approach. > > I realize the flow you're working with now is just a sort of > familiarization thing. But do you think this is something we should > tackle soon (based on real scenarios you face)? > > Thanks > Joe > > On Fri, Feb 6, 2015 at 3:07 PM, Mark Payne <[email protected] > <javascript:;>> wrote: > > Ricky, > > > > > > > > > > I don’t think there’s a JIRA ticket currently. Feel free to create one. > > > > > > > > > > I think we may need to do a better job documenting how the Remote > Process Groups. If you have a cluster setup, you would add a Remote Process > Group that points to the Cluster Manager. (I.e., the URL that you connect > to in order to see the graph). > > > > > > Then, anything that you send to the Remote Process Group will > automatically get load-balanced across all of the nodes in the cluster. So > you could setup a flow that looks something like: > > > > > > GenerateFlowFile -> RemoteProcessGroup > > > > > > Input Port -> HashContent > > > > > > So these 2 flows are disjoint. The first part generates data and then > distributes it to the cluster (when you connect to the Remote Process > Group, you choose which Input Port to send to). > > > > > > But what we’d like to do in the future is something like: > > > > > > GenerateFlowFile -> HashContent > > > > > > And then on the connection in the middle choose to auto-distribute the > data. Right now you have to put the Remote Process Group in there to > distribute to the cluster, and add the Input Port to receive the data. But > there should only be a single RemoteProcessGroup that points to the entire > cluster, not one per node. > > > > > > Thanks > > > > -Mark > > > > > > > > > > > > > > > > > > > > From: Ricky Saltzer > > Sent: Friday, February 6, 2015 3:06 PM > > To: [email protected] <javascript:;> > > > > > > > > > > > > Mark - > > > > Thanks for the fast reply, much appreciated. This is what I figured, but > > since I was already in clustered mode, I wanted to make sure there wasn't > > an easier way than adding each node as a remote process group. > > > > Is there already a JIRA to track the ability to auto distribute in > > clustered mode, or would you like me to open it up? > > > > Thanks again, > > Ricky > > > > On Fri, Feb 6, 2015 at 2:58 PM, Mark Payne <[email protected] > <javascript:;>> wrote: > > > >> Ricky, > >> > >> > >> The DistributeLoad processor is simply used to route to one of many > >> relationships. So if you have, for instance, 5 different servers that > you > >> can FTP files to, you can use DistributeLoad to round robin the files > >> between them, so that you end up pushing 20% to each of 5 PutFTP > processors. > >> > >> > >> What you’re wanting to do, it sounds like, is to distribute the > FlowFiles > >> to different nodes in the cluster. The Remote Process Group is how you > >> would need to do that at this time. We have discussed having the > ability to > >> mark a Connection as “Auto-Distributed” (or maybe some better name 😊) > and > >> have that automatically distribute the data between nodes in the > cluster, > >> but that feature hasn’t yet been implemented. > >> > >> > >> Does that answer your question? > >> > >> > >> Thanks > >> > >> -Mark > >> > >> > >> > >> > >> > >> > >> From: Ricky Saltzer > >> Sent: Friday, February 6, 2015 2:56 PM > >> To: [email protected] <javascript:;> > >> > >> > >> > >> > >> > >> Hi - > >> > >> I have a question regarding load distribution in a clustered NiFi > >> environment. I have a really simple example, I'm using the > GenerateFlowFile > >> processor to generate some random data, then I MD5 hash the file and > print > >> out the resulting hash. > >> > >> I want only the primary node to generate the data, but I want both > nodes in > >> the cluster to share the hashing workload. It appears if I set the > >> scheduling strategy to "On primary node" for the GenerateFlowFile > >> processor, then the next processor (HashContent) is only being accepted > and > >> processed by a single node. > >> > >> I've put DistributeLoad processor in-between the HashContent and > >> GenerateFlowFile, but this requires me to use the remote process group > to > >> distribute the load, which doesn't seem intuitive when I'm already > >> clustered. > >> > >> I guess my question is, is it possible for the DistributeLoad processor > to > >> understand that NiFi is in a clustered environment, and have an ability > to > >> distribute the next processor (HashContent) amongst all nodes in the > >> cluster? > >> > >> Cheers, > >> -- > >> Ricky Saltzer > >> http://www.cloudera.com > >> > > > > > > > > -- > > Ricky Saltzer > > http://www.cloudera.com > -- Best regards, - Andy Problems worthy of attack prove their worth by hitting back. - Piet Hein (via Tom White)
