Re: Load distribution in cluster mode

Mark Payne Fri, 06 Feb 2015 12:16:59 -0800

Ricky,

I don’t think there’s a JIRA ticket currently. Feel free to create one.

I think we may need to do a better job documenting how the Remote Process 
Groups. If you have a cluster setup, you would add a Remote Process Group that 
points to the Cluster Manager. (I.e., the URL that you connect to in order to 
see the graph).

Then, anything that you send to the Remote Process Group will automatically get 
load-balanced across all of the nodes in the cluster. So you could setup a flow 
that looks something like:

GenerateFlowFile -> RemoteProcessGroup

Input Port -> HashContent

So these 2 flows are disjoint. The first part generates data and then 
distributes it to the cluster (when you connect to the Remote Process Group, 
you choose which Input Port to send to).

But what we’d like to do in the future is something like:

GenerateFlowFile -> HashContent

And then on the connection in the middle choose to auto-distribute the data. 
Right now you have to put the Remote Process Group in there to distribute to 
the cluster, and add the Input Port to receive the data. But there should only 
be a single RemoteProcessGroup that points to the entire cluster, not one per 
node.

Thanks

-Mark

From: Ricky Saltzer
Sent: ‎Friday‎, ‎February‎ ‎6‎, ‎2015 ‎3‎:‎06‎ ‎PM
To: [email protected]

Mark -

Thanks for the fast reply, much appreciated. This is what I figured, but
since I was already in clustered mode, I wanted to make sure there wasn't
an easier way than adding each node as a remote process group.

Is there already a JIRA to track the ability to auto distribute in
clustered mode, or would you like me to open it up?

Thanks again,
Ricky

On Fri, Feb 6, 2015 at 2:58 PM, Mark Payne <[email protected]> wrote:

> Ricky,
>
>
> The DistributeLoad processor is simply used to route to one of many
> relationships. So if you have, for instance, 5 different servers that you
> can FTP files to, you can use DistributeLoad to round robin the files
> between them, so that you end up pushing 20% to each of 5 PutFTP processors.
>
>
> What you’re wanting to do, it sounds like, is to distribute the FlowFiles
> to different nodes in the cluster. The Remote Process Group is how you
> would need to do that at this time. We have discussed having the ability to
> mark a Connection as “Auto-Distributed” (or maybe some better name 😊) and
> have that automatically distribute the data between nodes in the cluster,
> but that feature hasn’t yet been implemented.
>
>
> Does that answer your question?
>
>
> Thanks
>
> -Mark
>
>
>
>
>
>
> From: Ricky Saltzer
> Sent: ‎Friday‎, ‎February‎ ‎6‎, ‎2015 ‎2‎:‎56‎ ‎PM
> To: [email protected]
>
>
>
>
>
> Hi -
>
> I have a question regarding load distribution in a clustered NiFi
> environment. I have a really simple example, I'm using the GenerateFlowFile
> processor to generate some random data, then I MD5 hash the file and print
> out the resulting hash.
>
> I want only the primary node to generate the data, but I want both nodes in
> the cluster to share the hashing workload. It appears if I set the
> scheduling strategy to "On primary node" for the GenerateFlowFile
> processor, then the next processor (HashContent) is only being accepted and
> processed by a single node.
>
> I've put DistributeLoad processor in-between the HashContent and
> GenerateFlowFile, but this requires me to use the remote process group to
> distribute the load, which doesn't seem intuitive when I'm already
> clustered.
>
> I guess my question is, is it possible for the DistributeLoad processor to
> understand that NiFi is in a clustered environment, and have an ability to
> distribute the next processor (HashContent) amongst all nodes in the
> cluster?
>
> Cheers,
> --
> Ricky Saltzer
> http://www.cloudera.com
>

-- 
Ricky Saltzer
http://www.cloudera.com

Re: Load distribution in cluster mode

Reply via email to