Re: Load distribution in cluster mode

Ricky Saltzer Sun, 08 Feb 2015 10:44:12 -0800

Agreed, making the site to site feature as easy to configure as a regular
processor would eliminate a lot of the confusion. Depending on how long it
would take to get this in as a first-class citizen, it might be worthwhile
writing up a "How to enable site to site" page so that the remote process
group can link to. As Mark described, this was easy, and things started to
work as soon as I restarted the services.


Ricky

On Sun, Feb 8, 2015 at 1:24 PM, Joe Witt <[email protected]> wrote:

> I think the change you made more recently was totally appropriate.
> The right answer here in my opinion is to provide a way for users to
> manage/view/understand this at runtime.
>
> Site to site is a pretty great feature and we just need to give it a
> more first-class treatment:
> - Great documentation for users in the app
> - Great documentation for the protocol itself and examples of clients
> (we should likely even help seed the development of a few for popular
> languages)
> - Good user experience at runtime to modify and understand what is
> happening, etc..
>
> Also sorry for my unbelievably unreadable e-mail on this thread
> earlier . I really should never send e-mails from my phone.
>
> Thanks
> Joe
>
> On Sun, Feb 8, 2015 at 1:17 PM, Mark Payne <[email protected]> wrote:
> > Originally, I set the default in the properties file so that
> site-to-site is configured to be secure but not enabled. I did this because
> I didn’t want to enable it as non-secure by default because I was afraid
> that this would be dangerous… so I required that users explicitly go in and
> set it up. But thinking back to this, maybe that was a mistake. We set the
> default UI port to be 8080 and non-secure, so maybe we should just set the
> default so that site-to-site is enabled non-secure, as well. That would
> probably just make this a lot easier.
> >
> >
> >
> >
> >
> >
> > From: Ricky Saltzer
> > Sent: ‎Sunday‎, ‎February‎ ‎8‎, ‎2015 ‎1‎:‎16‎ ‎PM
> > To: [email protected]
> >
> >
> >
> >
> >
> > Thanks for the tip, Mark! Allowing the user to enable the site to site
> > feature during runtime would be a good step in the right direction.
> > Documentation on how it works and why it's different from having your
> nodes
> > in a cluster would also make things easier to understand.
> >
> >
> > On Sun, Feb 8, 2015 at 9:11 AM, Joe Witt <[email protected]> wrote:
> >
> >> Site to site is a powerhouse feature but has caused a good bit of
> >> confusion.  Perhaps we should plan or its inclusion in the things that
> can
> >> be tuned/set at runtime.
> >>
> >> It would be good to include with that information about bounded
> >> interfaces.   Information about what messages will get sent, etc.
>  Folks
> >> in proxy type situations have a hard time reasoning over what is
> >> happening.  That is little sense of "is this thing on".
> >>
> >> What do you all think?
> >> On Feb 8, 2015 6:51 AM, "Mark Payne" <[email protected]> wrote:
> >>
> >> > Ricky,
> >> >
> >> > In the nifi.properties file, there's a property named
> >> > "nifi.remote.input.port". By default it's empty. Set that to whatever
> >> port
> >> > you want to use for site-to-site. Additionally, you'll need to either
> set
> >> > "nifi.remote.input.secure" to false or configure keystore and
> truststore
> >> > properties. Configure this for nodes and NCM.  After that you should
> be
> >> > good to go (after restart)!
> >> >
> >> > If you run into any issues let us know.
> >> >
> >> > Thanks
> >> > -Mark
> >> >
> >> > Sent from my iPhone
> >> >
> >> > > On Feb 8, 2015, at 5:54 AM, Ricky Saltzer <[email protected]>
> wrote:
> >> > >
> >> > > Hey Joe -
> >> > >
> >> > > This makes sense, and I'm in the process of trying it out now. I'm
> >> > running
> >> > > into a small issue where the remote process group is saying neither
> of
> >> > the
> >> > > nodes are configured for Site-to-Site communication.
> >> > >
> >> > > Although not super intuitive, sending to the remote process group
> >> > pointing
> >> > > to the cluster should be fine as long as it works (which I'm sure it
> >> > does).
> >> > >
> >> > > Ricky
> >> > >
> >> > >> On Fri, Feb 6, 2015 at 3:24 PM, Joe Witt <[email protected]>
> wrote:
> >> > >>
> >> > >> Ricky,
> >> > >>
> >> > >> So the use case you're coming from here is a good and common one
> which
> >> > is:
> >> > >>
> >> > >> If I have a datasource which does not offer scalabilty (it can only
> >> > >> send to a single node for instance) but I have a scalable
> distribution
> >> > >> cluster what are my options?
> >> > >>
> >> > >> So today you can accept the data on a single node then immediate
> do as
> >> > >> Mark describes and fire it to a "Remote Process Group" addressing
> the
> >> > >> cluster itself.  That way NiFi will automatically figure out all
> the
> >> > >> nodes in the cluster and spread the data around factoring in
> >> > >> load/etc..  But we do want to establish an even more automatic
> >> > >> mechanism on a connection itself where the user can indicate the
> data
> >> > >> should be auto-balanced.
> >> > >>
> >> > >> The reverse is really true as well where you can have a consumer
> which
> >> > >> only wants to accept from a single host.  So there too we need a
> >> > >> mechanism to descale the approach.
> >> > >>
> >> > >> I realize the flow you're working with now is just a sort of
> >> > >> familiarization thing.  But do you think this is something we
> should
> >> > >> tackle soon (based on real scenarios you face)?
> >> > >>
> >> > >> Thanks
> >> > >> Joe
> >> > >>
> >> > >>> On Fri, Feb 6, 2015 at 3:07 PM, Mark Payne <[email protected]>
> >> > wrote:
> >> > >>> Ricky,
> >> > >>>
> >> > >>>
> >> > >>>
> >> > >>>
> >> > >>> I don’t think there’s a JIRA ticket currently. Feel free to create
> >> one.
> >> > >>>
> >> > >>>
> >> > >>>
> >> > >>>
> >> > >>> I think we may need to do a better job documenting how the Remote
> >> > >> Process Groups. If you have a cluster setup, you would add a Remote
> >> > Process
> >> > >> Group that points to the Cluster Manager. (I.e., the URL that you
> >> > connect
> >> > >> to in order to see the graph).
> >> > >>>
> >> > >>>
> >> > >>> Then, anything that you send to the Remote Process Group will
> >> > >> automatically get load-balanced across all of the nodes in the
> >> cluster.
> >> > So
> >> > >> you could setup a flow that looks something like:
> >> > >>>
> >> > >>>
> >> > >>> GenerateFlowFile -> RemoteProcessGroup
> >> > >>>
> >> > >>>
> >> > >>> Input Port -> HashContent
> >> > >>>
> >> > >>>
> >> > >>> So these 2 flows are disjoint. The first part generates data and
> then
> >> > >> distributes it to the cluster (when you connect to the Remote
> Process
> >> > >> Group, you choose which Input Port to send to).
> >> > >>>
> >> > >>>
> >> > >>> But what we’d like to do in the future is something like:
> >> > >>>
> >> > >>>
> >> > >>> GenerateFlowFile -> HashContent
> >> > >>>
> >> > >>>
> >> > >>> And then on the connection in the middle choose to auto-distribute
> >> the
> >> > >> data. Right now you have to put the Remote Process Group in there
> to
> >> > >> distribute to the cluster, and add the Input Port to receive the
> data.
> >> > But
> >> > >> there should only be a single RemoteProcessGroup that points to the
> >> > entire
> >> > >> cluster, not one per node.
> >> > >>>
> >> > >>>
> >> > >>> Thanks
> >> > >>>
> >> > >>> -Mark
> >> > >>>
> >> > >>>
> >> > >>>
> >> > >>>
> >> > >>>
> >> > >>>
> >> > >>>
> >> > >>>
> >> > >>>
> >> > >>> From: Ricky Saltzer
> >> > >>> Sent: ‎Friday‎, ‎February‎ ‎6‎, ‎2015 ‎3‎:‎06‎ ‎PM
> >> > >>> To: [email protected]
> >> > >>>
> >> > >>>
> >> > >>>
> >> > >>>
> >> > >>>
> >> > >>> Mark -
> >> > >>>
> >> > >>> Thanks for the fast reply, much appreciated. This is what I
> figured,
> >> > but
> >> > >>> since I was already in clustered mode, I wanted to make sure there
> >> > wasn't
> >> > >>> an easier way than adding each node as a remote process group.
> >> > >>>
> >> > >>> Is there already a JIRA to track the ability to auto distribute in
> >> > >>> clustered mode, or would you like me to open it up?
> >> > >>>
> >> > >>> Thanks again,
> >> > >>> Ricky
> >> > >>>
> >> > >>>> On Fri, Feb 6, 2015 at 2:58 PM, Mark Payne <[email protected]
> >
> >> > wrote:
> >> > >>>>
> >> > >>>> Ricky,
> >> > >>>>
> >> > >>>>
> >> > >>>> The DistributeLoad processor is simply used to route to one of
> many
> >> > >>>> relationships. So if you have, for instance, 5 different servers
> >> that
> >> > >> you
> >> > >>>> can FTP files to, you can use DistributeLoad to round robin the
> >> files
> >> > >>>> between them, so that you end up pushing 20% to each of 5 PutFTP
> >> > >> processors.
> >> > >>>>
> >> > >>>>
> >> > >>>> What you’re wanting to do, it sounds like, is to distribute the
> >> > >> FlowFiles
> >> > >>>> to different nodes in the cluster. The Remote Process Group is
> how
> >> you
> >> > >>>> would need to do that at this time. We have discussed having the
> >> > >> ability to
> >> > >>>> mark a Connection as “Auto-Distributed” (or maybe some better
> name
> >> 😊)
> >> > >> and
> >> > >>>> have that automatically distribute the data between nodes in the
> >> > >> cluster,
> >> > >>>> but that feature hasn’t yet been implemented.
> >> > >>>>
> >> > >>>>
> >> > >>>> Does that answer your question?
> >> > >>>>
> >> > >>>>
> >> > >>>> Thanks
> >> > >>>>
> >> > >>>> -Mark
> >> > >>>>
> >> > >>>>
> >> > >>>>
> >> > >>>>
> >> > >>>>
> >> > >>>>
> >> > >>>> From: Ricky Saltzer
> >> > >>>> Sent: ‎Friday‎, ‎February‎ ‎6‎, ‎2015 ‎2‎:‎56‎ ‎PM
> >> > >>>> To: [email protected]
> >> > >>>>
> >> > >>>>
> >> > >>>>
> >> > >>>>
> >> > >>>>
> >> > >>>> Hi -
> >> > >>>>
> >> > >>>> I have a question regarding load distribution in a clustered NiFi
> >> > >>>> environment. I have a really simple example, I'm using the
> >> > >> GenerateFlowFile
> >> > >>>> processor to generate some random data, then I MD5 hash the file
> and
> >> > >> print
> >> > >>>> out the resulting hash.
> >> > >>>>
> >> > >>>> I want only the primary node to generate the data, but I want
> both
> >> > >> nodes in
> >> > >>>> the cluster to share the hashing workload. It appears if I set
> the
> >> > >>>> scheduling strategy to "On primary node" for the GenerateFlowFile
> >> > >>>> processor, then the next processor (HashContent) is only being
> >> > accepted
> >> > >> and
> >> > >>>> processed by a single node.
> >> > >>>>
> >> > >>>> I've put DistributeLoad processor in-between the HashContent and
> >> > >>>> GenerateFlowFile, but this requires me to use the remote process
> >> group
> >> > >> to
> >> > >>>> distribute the load, which doesn't seem intuitive when I'm
> already
> >> > >>>> clustered.
> >> > >>>>
> >> > >>>> I guess my question is, is it possible for the DistributeLoad
> >> > processor
> >> > >> to
> >> > >>>> understand that NiFi is in a clustered environment, and have an
> >> > ability
> >> > >> to
> >> > >>>> distribute the next processor (HashContent) amongst all nodes in
> the
> >> > >>>> cluster?
> >> > >>>>
> >> > >>>> Cheers,
> >> > >>>> --
> >> > >>>> Ricky Saltzer
> >> > >>>> http://www.cloudera.com
> >> > >>>
> >> > >>>
> >> > >>>
> >> > >>> --
> >> > >>> Ricky Saltzer
> >> > >>> http://www.cloudera.com
> >> > >
> >> > >
> >> > >
> >> > > --
> >> > > Ricky Saltzer
> >> > > http://www.cloudera.com
> >> >
> >>
> >
> >
> >
> > --
> > Ricky Saltzer
> > http://www.cloudera.com
>



-- 
Ricky Saltzer
http://www.cloudera.com

Re: Load distribution in cluster mode

Reply via email to