Site to site is a powerhouse feature but has caused a good bit of confusion. Perhaps we should plan or its inclusion in the things that can be tuned/set at runtime.
It would be good to include with that information about bounded interfaces. Information about what messages will get sent, etc. Folks in proxy type situations have a hard time reasoning over what is happening. That is little sense of "is this thing on". What do you all think? On Feb 8, 2015 6:51 AM, "Mark Payne" <[email protected]> wrote: > Ricky, > > In the nifi.properties file, there's a property named > "nifi.remote.input.port". By default it's empty. Set that to whatever port > you want to use for site-to-site. Additionally, you'll need to either set > "nifi.remote.input.secure" to false or configure keystore and truststore > properties. Configure this for nodes and NCM. After that you should be > good to go (after restart)! > > If you run into any issues let us know. > > Thanks > -Mark > > Sent from my iPhone > > > On Feb 8, 2015, at 5:54 AM, Ricky Saltzer <[email protected]> wrote: > > > > Hey Joe - > > > > This makes sense, and I'm in the process of trying it out now. I'm > running > > into a small issue where the remote process group is saying neither of > the > > nodes are configured for Site-to-Site communication. > > > > Although not super intuitive, sending to the remote process group > pointing > > to the cluster should be fine as long as it works (which I'm sure it > does). > > > > Ricky > > > >> On Fri, Feb 6, 2015 at 3:24 PM, Joe Witt <[email protected]> wrote: > >> > >> Ricky, > >> > >> So the use case you're coming from here is a good and common one which > is: > >> > >> If I have a datasource which does not offer scalabilty (it can only > >> send to a single node for instance) but I have a scalable distribution > >> cluster what are my options? > >> > >> So today you can accept the data on a single node then immediate do as > >> Mark describes and fire it to a "Remote Process Group" addressing the > >> cluster itself. That way NiFi will automatically figure out all the > >> nodes in the cluster and spread the data around factoring in > >> load/etc.. But we do want to establish an even more automatic > >> mechanism on a connection itself where the user can indicate the data > >> should be auto-balanced. > >> > >> The reverse is really true as well where you can have a consumer which > >> only wants to accept from a single host. So there too we need a > >> mechanism to descale the approach. > >> > >> I realize the flow you're working with now is just a sort of > >> familiarization thing. But do you think this is something we should > >> tackle soon (based on real scenarios you face)? > >> > >> Thanks > >> Joe > >> > >>> On Fri, Feb 6, 2015 at 3:07 PM, Mark Payne <[email protected]> > wrote: > >>> Ricky, > >>> > >>> > >>> > >>> > >>> I don’t think there’s a JIRA ticket currently. Feel free to create one. > >>> > >>> > >>> > >>> > >>> I think we may need to do a better job documenting how the Remote > >> Process Groups. If you have a cluster setup, you would add a Remote > Process > >> Group that points to the Cluster Manager. (I.e., the URL that you > connect > >> to in order to see the graph). > >>> > >>> > >>> Then, anything that you send to the Remote Process Group will > >> automatically get load-balanced across all of the nodes in the cluster. > So > >> you could setup a flow that looks something like: > >>> > >>> > >>> GenerateFlowFile -> RemoteProcessGroup > >>> > >>> > >>> Input Port -> HashContent > >>> > >>> > >>> So these 2 flows are disjoint. The first part generates data and then > >> distributes it to the cluster (when you connect to the Remote Process > >> Group, you choose which Input Port to send to). > >>> > >>> > >>> But what we’d like to do in the future is something like: > >>> > >>> > >>> GenerateFlowFile -> HashContent > >>> > >>> > >>> And then on the connection in the middle choose to auto-distribute the > >> data. Right now you have to put the Remote Process Group in there to > >> distribute to the cluster, and add the Input Port to receive the data. > But > >> there should only be a single RemoteProcessGroup that points to the > entire > >> cluster, not one per node. > >>> > >>> > >>> Thanks > >>> > >>> -Mark > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> From: Ricky Saltzer > >>> Sent: Friday, February 6, 2015 3:06 PM > >>> To: [email protected] > >>> > >>> > >>> > >>> > >>> > >>> Mark - > >>> > >>> Thanks for the fast reply, much appreciated. This is what I figured, > but > >>> since I was already in clustered mode, I wanted to make sure there > wasn't > >>> an easier way than adding each node as a remote process group. > >>> > >>> Is there already a JIRA to track the ability to auto distribute in > >>> clustered mode, or would you like me to open it up? > >>> > >>> Thanks again, > >>> Ricky > >>> > >>>> On Fri, Feb 6, 2015 at 2:58 PM, Mark Payne <[email protected]> > wrote: > >>>> > >>>> Ricky, > >>>> > >>>> > >>>> The DistributeLoad processor is simply used to route to one of many > >>>> relationships. So if you have, for instance, 5 different servers that > >> you > >>>> can FTP files to, you can use DistributeLoad to round robin the files > >>>> between them, so that you end up pushing 20% to each of 5 PutFTP > >> processors. > >>>> > >>>> > >>>> What you’re wanting to do, it sounds like, is to distribute the > >> FlowFiles > >>>> to different nodes in the cluster. The Remote Process Group is how you > >>>> would need to do that at this time. We have discussed having the > >> ability to > >>>> mark a Connection as “Auto-Distributed” (or maybe some better name 😊) > >> and > >>>> have that automatically distribute the data between nodes in the > >> cluster, > >>>> but that feature hasn’t yet been implemented. > >>>> > >>>> > >>>> Does that answer your question? > >>>> > >>>> > >>>> Thanks > >>>> > >>>> -Mark > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> From: Ricky Saltzer > >>>> Sent: Friday, February 6, 2015 2:56 PM > >>>> To: [email protected] > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> Hi - > >>>> > >>>> I have a question regarding load distribution in a clustered NiFi > >>>> environment. I have a really simple example, I'm using the > >> GenerateFlowFile > >>>> processor to generate some random data, then I MD5 hash the file and > >> print > >>>> out the resulting hash. > >>>> > >>>> I want only the primary node to generate the data, but I want both > >> nodes in > >>>> the cluster to share the hashing workload. It appears if I set the > >>>> scheduling strategy to "On primary node" for the GenerateFlowFile > >>>> processor, then the next processor (HashContent) is only being > accepted > >> and > >>>> processed by a single node. > >>>> > >>>> I've put DistributeLoad processor in-between the HashContent and > >>>> GenerateFlowFile, but this requires me to use the remote process group > >> to > >>>> distribute the load, which doesn't seem intuitive when I'm already > >>>> clustered. > >>>> > >>>> I guess my question is, is it possible for the DistributeLoad > processor > >> to > >>>> understand that NiFi is in a clustered environment, and have an > ability > >> to > >>>> distribute the next processor (HashContent) amongst all nodes in the > >>>> cluster? > >>>> > >>>> Cheers, > >>>> -- > >>>> Ricky Saltzer > >>>> http://www.cloudera.com > >>> > >>> > >>> > >>> -- > >>> Ricky Saltzer > >>> http://www.cloudera.com > > > > > > > > -- > > Ricky Saltzer > > http://www.cloudera.com >
