Originally, I set the default in the properties file so that site-to-site is configured to be secure but not enabled. I did this because I didn’t want to enable it as non-secure by default because I was afraid that this would be dangerous… so I required that users explicitly go in and set it up. But thinking back to this, maybe that was a mistake. We set the default UI port to be 8080 and non-secure, so maybe we should just set the default so that site-to-site is enabled non-secure, as well. That would probably just make this a lot easier.
From: Ricky Saltzer Sent: Sunday, February 8, 2015 1:16 PM To: [email protected] Thanks for the tip, Mark! Allowing the user to enable the site to site feature during runtime would be a good step in the right direction. Documentation on how it works and why it's different from having your nodes in a cluster would also make things easier to understand. On Sun, Feb 8, 2015 at 9:11 AM, Joe Witt <[email protected]> wrote: > Site to site is a powerhouse feature but has caused a good bit of > confusion. Perhaps we should plan or its inclusion in the things that can > be tuned/set at runtime. > > It would be good to include with that information about bounded > interfaces. Information about what messages will get sent, etc. Folks > in proxy type situations have a hard time reasoning over what is > happening. That is little sense of "is this thing on". > > What do you all think? > On Feb 8, 2015 6:51 AM, "Mark Payne" <[email protected]> wrote: > > > Ricky, > > > > In the nifi.properties file, there's a property named > > "nifi.remote.input.port". By default it's empty. Set that to whatever > port > > you want to use for site-to-site. Additionally, you'll need to either set > > "nifi.remote.input.secure" to false or configure keystore and truststore > > properties. Configure this for nodes and NCM. After that you should be > > good to go (after restart)! > > > > If you run into any issues let us know. > > > > Thanks > > -Mark > > > > Sent from my iPhone > > > > > On Feb 8, 2015, at 5:54 AM, Ricky Saltzer <[email protected]> wrote: > > > > > > Hey Joe - > > > > > > This makes sense, and I'm in the process of trying it out now. I'm > > running > > > into a small issue where the remote process group is saying neither of > > the > > > nodes are configured for Site-to-Site communication. > > > > > > Although not super intuitive, sending to the remote process group > > pointing > > > to the cluster should be fine as long as it works (which I'm sure it > > does). > > > > > > Ricky > > > > > >> On Fri, Feb 6, 2015 at 3:24 PM, Joe Witt <[email protected]> wrote: > > >> > > >> Ricky, > > >> > > >> So the use case you're coming from here is a good and common one which > > is: > > >> > > >> If I have a datasource which does not offer scalabilty (it can only > > >> send to a single node for instance) but I have a scalable distribution > > >> cluster what are my options? > > >> > > >> So today you can accept the data on a single node then immediate do as > > >> Mark describes and fire it to a "Remote Process Group" addressing the > > >> cluster itself. That way NiFi will automatically figure out all the > > >> nodes in the cluster and spread the data around factoring in > > >> load/etc.. But we do want to establish an even more automatic > > >> mechanism on a connection itself where the user can indicate the data > > >> should be auto-balanced. > > >> > > >> The reverse is really true as well where you can have a consumer which > > >> only wants to accept from a single host. So there too we need a > > >> mechanism to descale the approach. > > >> > > >> I realize the flow you're working with now is just a sort of > > >> familiarization thing. But do you think this is something we should > > >> tackle soon (based on real scenarios you face)? > > >> > > >> Thanks > > >> Joe > > >> > > >>> On Fri, Feb 6, 2015 at 3:07 PM, Mark Payne <[email protected]> > > wrote: > > >>> Ricky, > > >>> > > >>> > > >>> > > >>> > > >>> I don’t think there’s a JIRA ticket currently. Feel free to create > one. > > >>> > > >>> > > >>> > > >>> > > >>> I think we may need to do a better job documenting how the Remote > > >> Process Groups. If you have a cluster setup, you would add a Remote > > Process > > >> Group that points to the Cluster Manager. (I.e., the URL that you > > connect > > >> to in order to see the graph). > > >>> > > >>> > > >>> Then, anything that you send to the Remote Process Group will > > >> automatically get load-balanced across all of the nodes in the > cluster. > > So > > >> you could setup a flow that looks something like: > > >>> > > >>> > > >>> GenerateFlowFile -> RemoteProcessGroup > > >>> > > >>> > > >>> Input Port -> HashContent > > >>> > > >>> > > >>> So these 2 flows are disjoint. The first part generates data and then > > >> distributes it to the cluster (when you connect to the Remote Process > > >> Group, you choose which Input Port to send to). > > >>> > > >>> > > >>> But what we’d like to do in the future is something like: > > >>> > > >>> > > >>> GenerateFlowFile -> HashContent > > >>> > > >>> > > >>> And then on the connection in the middle choose to auto-distribute > the > > >> data. Right now you have to put the Remote Process Group in there to > > >> distribute to the cluster, and add the Input Port to receive the data. > > But > > >> there should only be a single RemoteProcessGroup that points to the > > entire > > >> cluster, not one per node. > > >>> > > >>> > > >>> Thanks > > >>> > > >>> -Mark > > >>> > > >>> > > >>> > > >>> > > >>> > > >>> > > >>> > > >>> > > >>> > > >>> From: Ricky Saltzer > > >>> Sent: Friday, February 6, 2015 3:06 PM > > >>> To: [email protected] > > >>> > > >>> > > >>> > > >>> > > >>> > > >>> Mark - > > >>> > > >>> Thanks for the fast reply, much appreciated. This is what I figured, > > but > > >>> since I was already in clustered mode, I wanted to make sure there > > wasn't > > >>> an easier way than adding each node as a remote process group. > > >>> > > >>> Is there already a JIRA to track the ability to auto distribute in > > >>> clustered mode, or would you like me to open it up? > > >>> > > >>> Thanks again, > > >>> Ricky > > >>> > > >>>> On Fri, Feb 6, 2015 at 2:58 PM, Mark Payne <[email protected]> > > wrote: > > >>>> > > >>>> Ricky, > > >>>> > > >>>> > > >>>> The DistributeLoad processor is simply used to route to one of many > > >>>> relationships. So if you have, for instance, 5 different servers > that > > >> you > > >>>> can FTP files to, you can use DistributeLoad to round robin the > files > > >>>> between them, so that you end up pushing 20% to each of 5 PutFTP > > >> processors. > > >>>> > > >>>> > > >>>> What you’re wanting to do, it sounds like, is to distribute the > > >> FlowFiles > > >>>> to different nodes in the cluster. The Remote Process Group is how > you > > >>>> would need to do that at this time. We have discussed having the > > >> ability to > > >>>> mark a Connection as “Auto-Distributed” (or maybe some better name > 😊) > > >> and > > >>>> have that automatically distribute the data between nodes in the > > >> cluster, > > >>>> but that feature hasn’t yet been implemented. > > >>>> > > >>>> > > >>>> Does that answer your question? > > >>>> > > >>>> > > >>>> Thanks > > >>>> > > >>>> -Mark > > >>>> > > >>>> > > >>>> > > >>>> > > >>>> > > >>>> > > >>>> From: Ricky Saltzer > > >>>> Sent: Friday, February 6, 2015 2:56 PM > > >>>> To: [email protected] > > >>>> > > >>>> > > >>>> > > >>>> > > >>>> > > >>>> Hi - > > >>>> > > >>>> I have a question regarding load distribution in a clustered NiFi > > >>>> environment. I have a really simple example, I'm using the > > >> GenerateFlowFile > > >>>> processor to generate some random data, then I MD5 hash the file and > > >> print > > >>>> out the resulting hash. > > >>>> > > >>>> I want only the primary node to generate the data, but I want both > > >> nodes in > > >>>> the cluster to share the hashing workload. It appears if I set the > > >>>> scheduling strategy to "On primary node" for the GenerateFlowFile > > >>>> processor, then the next processor (HashContent) is only being > > accepted > > >> and > > >>>> processed by a single node. > > >>>> > > >>>> I've put DistributeLoad processor in-between the HashContent and > > >>>> GenerateFlowFile, but this requires me to use the remote process > group > > >> to > > >>>> distribute the load, which doesn't seem intuitive when I'm already > > >>>> clustered. > > >>>> > > >>>> I guess my question is, is it possible for the DistributeLoad > > processor > > >> to > > >>>> understand that NiFi is in a clustered environment, and have an > > ability > > >> to > > >>>> distribute the next processor (HashContent) amongst all nodes in the > > >>>> cluster? > > >>>> > > >>>> Cheers, > > >>>> -- > > >>>> Ricky Saltzer > > >>>> http://www.cloudera.com > > >>> > > >>> > > >>> > > >>> -- > > >>> Ricky Saltzer > > >>> http://www.cloudera.com > > > > > > > > > > > > -- > > > Ricky Saltzer > > > http://www.cloudera.com > > > -- Ricky Saltzer http://www.cloudera.com
