Re: Load distribution in cluster mode

Joe Witt Sun, 08 Feb 2015 06:15:08 -0800

Site to site is a powerhouse feature but has caused a good bit of
confusion.  Perhaps we should plan or its inclusion in the things that can
be tuned/set at runtime.


It would be good to include with that information about bounded
interfaces.   Information about what messages will get sent, etc.   Folks
in proxy type situations have a hard time reasoning over what is
happening.  That is little sense of "is this thing on".

What do you all think?
On Feb 8, 2015 6:51 AM, "Mark Payne" <[email protected]> wrote:

> Ricky,
>
> In the nifi.properties file, there's a property named
> "nifi.remote.input.port". By default it's empty. Set that to whatever port
> you want to use for site-to-site. Additionally, you'll need to either set
> "nifi.remote.input.secure" to false or configure keystore and truststore
> properties. Configure this for nodes and NCM.  After that you should be
> good to go (after restart)!
>
> If you run into any issues let us know.
>
> Thanks
> -Mark
>
> Sent from my iPhone
>
> > On Feb 8, 2015, at 5:54 AM, Ricky Saltzer <[email protected]> wrote:
> >
> > Hey Joe -
> >
> > This makes sense, and I'm in the process of trying it out now. I'm
> running
> > into a small issue where the remote process group is saying neither of
> the
> > nodes are configured for Site-to-Site communication.
> >
> > Although not super intuitive, sending to the remote process group
> pointing
> > to the cluster should be fine as long as it works (which I'm sure it
> does).
> >
> > Ricky
> >
> >> On Fri, Feb 6, 2015 at 3:24 PM, Joe Witt <[email protected]> wrote:
> >>
> >> Ricky,
> >>
> >> So the use case you're coming from here is a good and common one which
> is:
> >>
> >> If I have a datasource which does not offer scalabilty (it can only
> >> send to a single node for instance) but I have a scalable distribution
> >> cluster what are my options?
> >>
> >> So today you can accept the data on a single node then immediate do as
> >> Mark describes and fire it to a "Remote Process Group" addressing the
> >> cluster itself.  That way NiFi will automatically figure out all the
> >> nodes in the cluster and spread the data around factoring in
> >> load/etc..  But we do want to establish an even more automatic
> >> mechanism on a connection itself where the user can indicate the data
> >> should be auto-balanced.
> >>
> >> The reverse is really true as well where you can have a consumer which
> >> only wants to accept from a single host.  So there too we need a
> >> mechanism to descale the approach.
> >>
> >> I realize the flow you're working with now is just a sort of
> >> familiarization thing.  But do you think this is something we should
> >> tackle soon (based on real scenarios you face)?
> >>
> >> Thanks
> >> Joe
> >>
> >>> On Fri, Feb 6, 2015 at 3:07 PM, Mark Payne <[email protected]>
> wrote:
> >>> Ricky,
> >>>
> >>>
> >>>
> >>>
> >>> I don’t think there’s a JIRA ticket currently. Feel free to create one.
> >>>
> >>>
> >>>
> >>>
> >>> I think we may need to do a better job documenting how the Remote
> >> Process Groups. If you have a cluster setup, you would add a Remote
> Process
> >> Group that points to the Cluster Manager. (I.e., the URL that you
> connect
> >> to in order to see the graph).
> >>>
> >>>
> >>> Then, anything that you send to the Remote Process Group will
> >> automatically get load-balanced across all of the nodes in the cluster.
> So
> >> you could setup a flow that looks something like:
> >>>
> >>>
> >>> GenerateFlowFile -> RemoteProcessGroup
> >>>
> >>>
> >>> Input Port -> HashContent
> >>>
> >>>
> >>> So these 2 flows are disjoint. The first part generates data and then
> >> distributes it to the cluster (when you connect to the Remote Process
> >> Group, you choose which Input Port to send to).
> >>>
> >>>
> >>> But what we’d like to do in the future is something like:
> >>>
> >>>
> >>> GenerateFlowFile -> HashContent
> >>>
> >>>
> >>> And then on the connection in the middle choose to auto-distribute the
> >> data. Right now you have to put the Remote Process Group in there to
> >> distribute to the cluster, and add the Input Port to receive the data.
> But
> >> there should only be a single RemoteProcessGroup that points to the
> entire
> >> cluster, not one per node.
> >>>
> >>>
> >>> Thanks
> >>>
> >>> -Mark
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> From: Ricky Saltzer
> >>> Sent: ‎Friday‎, ‎February‎ ‎6‎, ‎2015 ‎3‎:‎06‎ ‎PM
> >>> To: [email protected]
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> Mark -
> >>>
> >>> Thanks for the fast reply, much appreciated. This is what I figured,
> but
> >>> since I was already in clustered mode, I wanted to make sure there
> wasn't
> >>> an easier way than adding each node as a remote process group.
> >>>
> >>> Is there already a JIRA to track the ability to auto distribute in
> >>> clustered mode, or would you like me to open it up?
> >>>
> >>> Thanks again,
> >>> Ricky
> >>>
> >>>> On Fri, Feb 6, 2015 at 2:58 PM, Mark Payne <[email protected]>
> wrote:
> >>>>
> >>>> Ricky,
> >>>>
> >>>>
> >>>> The DistributeLoad processor is simply used to route to one of many
> >>>> relationships. So if you have, for instance, 5 different servers that
> >> you
> >>>> can FTP files to, you can use DistributeLoad to round robin the files
> >>>> between them, so that you end up pushing 20% to each of 5 PutFTP
> >> processors.
> >>>>
> >>>>
> >>>> What you’re wanting to do, it sounds like, is to distribute the
> >> FlowFiles
> >>>> to different nodes in the cluster. The Remote Process Group is how you
> >>>> would need to do that at this time. We have discussed having the
> >> ability to
> >>>> mark a Connection as “Auto-Distributed” (or maybe some better name 😊)
> >> and
> >>>> have that automatically distribute the data between nodes in the
> >> cluster,
> >>>> but that feature hasn’t yet been implemented.
> >>>>
> >>>>
> >>>> Does that answer your question?
> >>>>
> >>>>
> >>>> Thanks
> >>>>
> >>>> -Mark
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> From: Ricky Saltzer
> >>>> Sent: ‎Friday‎, ‎February‎ ‎6‎, ‎2015 ‎2‎:‎56‎ ‎PM
> >>>> To: [email protected]
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> Hi -
> >>>>
> >>>> I have a question regarding load distribution in a clustered NiFi
> >>>> environment. I have a really simple example, I'm using the
> >> GenerateFlowFile
> >>>> processor to generate some random data, then I MD5 hash the file and
> >> print
> >>>> out the resulting hash.
> >>>>
> >>>> I want only the primary node to generate the data, but I want both
> >> nodes in
> >>>> the cluster to share the hashing workload. It appears if I set the
> >>>> scheduling strategy to "On primary node" for the GenerateFlowFile
> >>>> processor, then the next processor (HashContent) is only being
> accepted
> >> and
> >>>> processed by a single node.
> >>>>
> >>>> I've put DistributeLoad processor in-between the HashContent and
> >>>> GenerateFlowFile, but this requires me to use the remote process group
> >> to
> >>>> distribute the load, which doesn't seem intuitive when I'm already
> >>>> clustered.
> >>>>
> >>>> I guess my question is, is it possible for the DistributeLoad
> processor
> >> to
> >>>> understand that NiFi is in a clustered environment, and have an
> ability
> >> to
> >>>> distribute the next processor (HashContent) amongst all nodes in the
> >>>> cluster?
> >>>>
> >>>> Cheers,
> >>>> --
> >>>> Ricky Saltzer
> >>>> http://www.cloudera.com
> >>>
> >>>
> >>>
> >>> --
> >>> Ricky Saltzer
> >>> http://www.cloudera.com
> >
> >
> >
> > --
> > Ricky Saltzer
> > http://www.cloudera.com
>

Re: Load distribution in cluster mode

Reply via email to