Re: Load distribution in cluster mode

Mark Payne Sun, 08 Feb 2015 10:20:43 -0800

Originally, I set the default in the properties file so that site-to-site is 
configured to be secure but not enabled. I did this because I didn’t want to 
enable it as non-secure by default because I was afraid that this would be 
dangerous… so I required that users explicitly go in and set it up. But 
thinking back to this, maybe that was a mistake. We set the default UI port to 
be 8080 and non-secure, so maybe we should just set the default so that 
site-to-site is enabled non-secure, as well. That would probably just make this 
a lot easier.







From: Ricky Saltzer
Sent: ‎Sunday‎, ‎February‎ ‎8‎, ‎2015 ‎1‎:‎16‎ ‎PM
To: [email protected]





Thanks for the tip, Mark! Allowing the user to enable the site to site
feature during runtime would be a good step in the right direction.
Documentation on how it works and why it's different from having your nodes
in a cluster would also make things easier to understand.


On Sun, Feb 8, 2015 at 9:11 AM, Joe Witt <[email protected]> wrote:

> Site to site is a powerhouse feature but has caused a good bit of
> confusion.  Perhaps we should plan or its inclusion in the things that can
> be tuned/set at runtime.
>
> It would be good to include with that information about bounded
> interfaces.   Information about what messages will get sent, etc.   Folks
> in proxy type situations have a hard time reasoning over what is
> happening.  That is little sense of "is this thing on".
>
> What do you all think?
> On Feb 8, 2015 6:51 AM, "Mark Payne" <[email protected]> wrote:
>
> > Ricky,
> >
> > In the nifi.properties file, there's a property named
> > "nifi.remote.input.port". By default it's empty. Set that to whatever
> port
> > you want to use for site-to-site. Additionally, you'll need to either set
> > "nifi.remote.input.secure" to false or configure keystore and truststore
> > properties. Configure this for nodes and NCM.  After that you should be
> > good to go (after restart)!
> >
> > If you run into any issues let us know.
> >
> > Thanks
> > -Mark
> >
> > Sent from my iPhone
> >
> > > On Feb 8, 2015, at 5:54 AM, Ricky Saltzer <[email protected]> wrote:
> > >
> > > Hey Joe -
> > >
> > > This makes sense, and I'm in the process of trying it out now. I'm
> > running
> > > into a small issue where the remote process group is saying neither of
> > the
> > > nodes are configured for Site-to-Site communication.
> > >
> > > Although not super intuitive, sending to the remote process group
> > pointing
> > > to the cluster should be fine as long as it works (which I'm sure it
> > does).
> > >
> > > Ricky
> > >
> > >> On Fri, Feb 6, 2015 at 3:24 PM, Joe Witt <[email protected]> wrote:
> > >>
> > >> Ricky,
> > >>
> > >> So the use case you're coming from here is a good and common one which
> > is:
> > >>
> > >> If I have a datasource which does not offer scalabilty (it can only
> > >> send to a single node for instance) but I have a scalable distribution
> > >> cluster what are my options?
> > >>
> > >> So today you can accept the data on a single node then immediate do as
> > >> Mark describes and fire it to a "Remote Process Group" addressing the
> > >> cluster itself.  That way NiFi will automatically figure out all the
> > >> nodes in the cluster and spread the data around factoring in
> > >> load/etc..  But we do want to establish an even more automatic
> > >> mechanism on a connection itself where the user can indicate the data
> > >> should be auto-balanced.
> > >>
> > >> The reverse is really true as well where you can have a consumer which
> > >> only wants to accept from a single host.  So there too we need a
> > >> mechanism to descale the approach.
> > >>
> > >> I realize the flow you're working with now is just a sort of
> > >> familiarization thing.  But do you think this is something we should
> > >> tackle soon (based on real scenarios you face)?
> > >>
> > >> Thanks
> > >> Joe
> > >>
> > >>> On Fri, Feb 6, 2015 at 3:07 PM, Mark Payne <[email protected]>
> > wrote:
> > >>> Ricky,
> > >>>
> > >>>
> > >>>
> > >>>
> > >>> I don’t think there’s a JIRA ticket currently. Feel free to create
> one.
> > >>>
> > >>>
> > >>>
> > >>>
> > >>> I think we may need to do a better job documenting how the Remote
> > >> Process Groups. If you have a cluster setup, you would add a Remote
> > Process
> > >> Group that points to the Cluster Manager. (I.e., the URL that you
> > connect
> > >> to in order to see the graph).
> > >>>
> > >>>
> > >>> Then, anything that you send to the Remote Process Group will
> > >> automatically get load-balanced across all of the nodes in the
> cluster.
> > So
> > >> you could setup a flow that looks something like:
> > >>>
> > >>>
> > >>> GenerateFlowFile -> RemoteProcessGroup
> > >>>
> > >>>
> > >>> Input Port -> HashContent
> > >>>
> > >>>
> > >>> So these 2 flows are disjoint. The first part generates data and then
> > >> distributes it to the cluster (when you connect to the Remote Process
> > >> Group, you choose which Input Port to send to).
> > >>>
> > >>>
> > >>> But what we’d like to do in the future is something like:
> > >>>
> > >>>
> > >>> GenerateFlowFile -> HashContent
> > >>>
> > >>>
> > >>> And then on the connection in the middle choose to auto-distribute
> the
> > >> data. Right now you have to put the Remote Process Group in there to
> > >> distribute to the cluster, and add the Input Port to receive the data.
> > But
> > >> there should only be a single RemoteProcessGroup that points to the
> > entire
> > >> cluster, not one per node.
> > >>>
> > >>>
> > >>> Thanks
> > >>>
> > >>> -Mark
> > >>>
> > >>>
> > >>>
> > >>>
> > >>>
> > >>>
> > >>>
> > >>>
> > >>>
> > >>> From: Ricky Saltzer
> > >>> Sent: ‎Friday‎, ‎February‎ ‎6‎, ‎2015 ‎3‎:‎06‎ ‎PM
> > >>> To: [email protected]
> > >>>
> > >>>
> > >>>
> > >>>
> > >>>
> > >>> Mark -
> > >>>
> > >>> Thanks for the fast reply, much appreciated. This is what I figured,
> > but
> > >>> since I was already in clustered mode, I wanted to make sure there
> > wasn't
> > >>> an easier way than adding each node as a remote process group.
> > >>>
> > >>> Is there already a JIRA to track the ability to auto distribute in
> > >>> clustered mode, or would you like me to open it up?
> > >>>
> > >>> Thanks again,
> > >>> Ricky
> > >>>
> > >>>> On Fri, Feb 6, 2015 at 2:58 PM, Mark Payne <[email protected]>
> > wrote:
> > >>>>
> > >>>> Ricky,
> > >>>>
> > >>>>
> > >>>> The DistributeLoad processor is simply used to route to one of many
> > >>>> relationships. So if you have, for instance, 5 different servers
> that
> > >> you
> > >>>> can FTP files to, you can use DistributeLoad to round robin the
> files
> > >>>> between them, so that you end up pushing 20% to each of 5 PutFTP
> > >> processors.
> > >>>>
> > >>>>
> > >>>> What you’re wanting to do, it sounds like, is to distribute the
> > >> FlowFiles
> > >>>> to different nodes in the cluster. The Remote Process Group is how
> you
> > >>>> would need to do that at this time. We have discussed having the
> > >> ability to
> > >>>> mark a Connection as “Auto-Distributed” (or maybe some better name
> 😊)
> > >> and
> > >>>> have that automatically distribute the data between nodes in the
> > >> cluster,
> > >>>> but that feature hasn’t yet been implemented.
> > >>>>
> > >>>>
> > >>>> Does that answer your question?
> > >>>>
> > >>>>
> > >>>> Thanks
> > >>>>
> > >>>> -Mark
> > >>>>
> > >>>>
> > >>>>
> > >>>>
> > >>>>
> > >>>>
> > >>>> From: Ricky Saltzer
> > >>>> Sent: ‎Friday‎, ‎February‎ ‎6‎, ‎2015 ‎2‎:‎56‎ ‎PM
> > >>>> To: [email protected]
> > >>>>
> > >>>>
> > >>>>
> > >>>>
> > >>>>
> > >>>> Hi -
> > >>>>
> > >>>> I have a question regarding load distribution in a clustered NiFi
> > >>>> environment. I have a really simple example, I'm using the
> > >> GenerateFlowFile
> > >>>> processor to generate some random data, then I MD5 hash the file and
> > >> print
> > >>>> out the resulting hash.
> > >>>>
> > >>>> I want only the primary node to generate the data, but I want both
> > >> nodes in
> > >>>> the cluster to share the hashing workload. It appears if I set the
> > >>>> scheduling strategy to "On primary node" for the GenerateFlowFile
> > >>>> processor, then the next processor (HashContent) is only being
> > accepted
> > >> and
> > >>>> processed by a single node.
> > >>>>
> > >>>> I've put DistributeLoad processor in-between the HashContent and
> > >>>> GenerateFlowFile, but this requires me to use the remote process
> group
> > >> to
> > >>>> distribute the load, which doesn't seem intuitive when I'm already
> > >>>> clustered.
> > >>>>
> > >>>> I guess my question is, is it possible for the DistributeLoad
> > processor
> > >> to
> > >>>> understand that NiFi is in a clustered environment, and have an
> > ability
> > >> to
> > >>>> distribute the next processor (HashContent) amongst all nodes in the
> > >>>> cluster?
> > >>>>
> > >>>> Cheers,
> > >>>> --
> > >>>> Ricky Saltzer
> > >>>> http://www.cloudera.com
> > >>>
> > >>>
> > >>>
> > >>> --
> > >>> Ricky Saltzer
> > >>> http://www.cloudera.com
> > >
> > >
> > >
> > > --
> > > Ricky Saltzer
> > > http://www.cloudera.com
> >
>



-- 
Ricky Saltzer
http://www.cloudera.com

Re: Load distribution in cluster mode

Reply via email to