Re: Load distribution in cluster mode

Andrew Purtell Sun, 08 Feb 2015 10:49:25 -0800

> But do you think this is something we should tackle soon (based on real
scenarios you face)?


+1

I was looking at the user guide as a first timer. Because I had somehow an
assumption that a NiFi cluster would transparently scale out (probably due
to Storm and Spark Streaming), in the sense that I would define
logical flows but there would then be a 'hidden' scale out physical plan
that emerges from it, having to explicitly do this with remote process
groups was a surprise. The resulting disjoint flows complicate what would
otherwise be a clean and easy to follow flow design? When I used to do flow
based programming with Cascading I'd define logical assemblies and its
topographical planner (
http://docs.cascading.org/cascading/2.6/userguide/html/ch14s02.html) would
automatically identify the parallelism possible in the design and
distribute it.



On Friday, February 6, 2015, Joe Witt <[email protected]> wrote:

> Ricky,
>
> So the use case you're coming from here is a good and common one which is:
>
> If I have a datasource which does not offer scalabilty (it can only
> send to a single node for instance) but I have a scalable distribution
> cluster what are my options?
>
> So today you can accept the data on a single node then immediate do as
> Mark describes and fire it to a "Remote Process Group" addressing the
> cluster itself.  That way NiFi will automatically figure out all the
> nodes in the cluster and spread the data around factoring in
> load/etc..  But we do want to establish an even more automatic
> mechanism on a connection itself where the user can indicate the data
> should be auto-balanced.
>
> The reverse is really true as well where you can have a consumer which
> only wants to accept from a single host.  So there too we need a
> mechanism to descale the approach.
>
> I realize the flow you're working with now is just a sort of
> familiarization thing.  But do you think this is something we should
> tackle soon (based on real scenarios you face)?
>
> Thanks
> Joe
>
> On Fri, Feb 6, 2015 at 3:07 PM, Mark Payne <[email protected]
> <javascript:;>> wrote:
> > Ricky,
> >
> >
> >
> >
> > I don’t think there’s a JIRA ticket currently. Feel free to create one.
> >
> >
> >
> >
> > I think we may need to do a better job documenting how the Remote
> Process Groups. If you have a cluster setup, you would add a Remote Process
> Group that points to the Cluster Manager. (I.e., the URL that you connect
> to in order to see the graph).
> >
> >
> > Then, anything that you send to the Remote Process Group will
> automatically get load-balanced across all of the nodes in the cluster. So
> you could setup a flow that looks something like:
> >
> >
> > GenerateFlowFile -> RemoteProcessGroup
> >
> >
> > Input Port -> HashContent
> >
> >
> > So these 2 flows are disjoint. The first part generates data and then
> distributes it to the cluster (when you connect to the Remote Process
> Group, you choose which Input Port to send to).
> >
> >
> > But what we’d like to do in the future is something like:
> >
> >
> > GenerateFlowFile -> HashContent
> >
> >
> > And then on the connection in the middle choose to auto-distribute the
> data. Right now you have to put the Remote Process Group in there to
> distribute to the cluster, and add the Input Port to receive the data. But
> there should only be a single RemoteProcessGroup that points to the entire
> cluster, not one per node.
> >
> >
> > Thanks
> >
> > -Mark
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > From: Ricky Saltzer
> > Sent: ‎Friday‎, ‎February‎ ‎6‎, ‎2015 ‎3‎:‎06‎ ‎PM
> > To: [email protected] <javascript:;>
> >
> >
> >
> >
> >
> > Mark -
> >
> > Thanks for the fast reply, much appreciated. This is what I figured, but
> > since I was already in clustered mode, I wanted to make sure there wasn't
> > an easier way than adding each node as a remote process group.
> >
> > Is there already a JIRA to track the ability to auto distribute in
> > clustered mode, or would you like me to open it up?
> >
> > Thanks again,
> > Ricky
> >
> > On Fri, Feb 6, 2015 at 2:58 PM, Mark Payne <[email protected]
> <javascript:;>> wrote:
> >
> >> Ricky,
> >>
> >>
> >> The DistributeLoad processor is simply used to route to one of many
> >> relationships. So if you have, for instance, 5 different servers that
> you
> >> can FTP files to, you can use DistributeLoad to round robin the files
> >> between them, so that you end up pushing 20% to each of 5 PutFTP
> processors.
> >>
> >>
> >> What you’re wanting to do, it sounds like, is to distribute the
> FlowFiles
> >> to different nodes in the cluster. The Remote Process Group is how you
> >> would need to do that at this time. We have discussed having the
> ability to
> >> mark a Connection as “Auto-Distributed” (or maybe some better name 😊)
> and
> >> have that automatically distribute the data between nodes in the
> cluster,
> >> but that feature hasn’t yet been implemented.
> >>
> >>
> >> Does that answer your question?
> >>
> >>
> >> Thanks
> >>
> >> -Mark
> >>
> >>
> >>
> >>
> >>
> >>
> >> From: Ricky Saltzer
> >> Sent: ‎Friday‎, ‎February‎ ‎6‎, ‎2015 ‎2‎:‎56‎ ‎PM
> >> To: [email protected] <javascript:;>
> >>
> >>
> >>
> >>
> >>
> >> Hi -
> >>
> >> I have a question regarding load distribution in a clustered NiFi
> >> environment. I have a really simple example, I'm using the
> GenerateFlowFile
> >> processor to generate some random data, then I MD5 hash the file and
> print
> >> out the resulting hash.
> >>
> >> I want only the primary node to generate the data, but I want both
> nodes in
> >> the cluster to share the hashing workload. It appears if I set the
> >> scheduling strategy to "On primary node" for the GenerateFlowFile
> >> processor, then the next processor (HashContent) is only being accepted
> and
> >> processed by a single node.
> >>
> >> I've put DistributeLoad processor in-between the HashContent and
> >> GenerateFlowFile, but this requires me to use the remote process group
> to
> >> distribute the load, which doesn't seem intuitive when I'm already
> >> clustered.
> >>
> >> I guess my question is, is it possible for the DistributeLoad processor
> to
> >> understand that NiFi is in a clustered environment, and have an ability
> to
> >> distribute the next processor (HashContent) amongst all nodes in the
> >> cluster?
> >>
> >> Cheers,
> >> --
> >> Ricky Saltzer
> >> http://www.cloudera.com
> >>
> >
> >
> >
> > --
> > Ricky Saltzer
> > http://www.cloudera.com
>


-- 
Best regards,

   - Andy

Problems worthy of attack prove their worth by hitting back. - Piet Hein
(via Tom White)

Re: Load distribution in cluster mode

Reply via email to