Re: Batch Processors

Joe Witt Tue, 24 Feb 2015 06:44:16 -0800

Mike,

This is extremely common.  Both sides of this are.  You have some
low-latency or batch producer and you want to delivery to some low
latency of batch receiver.

This is what splitting is for (in the case of going large to small) or
joining is for (in the case of 'batching').  MergeContent is designed
for the batching/aggregation case.  It allows you to merge using a
couple strategies with binary concatenation being the most common.

The very classic example is receiving a live stream of data which
needs to be sent to HDFS.  We'd setup MergeContent to aggregate data
to a size that is close to or matches the desired HDFS block size.

Now the part this is interesting that you mention is what if 'object'
45 of 100 causes a problem with the downstream system.  How
would/could NiFi know about that object?  Is it not feasible to
evaluate the data for its fitness to merge prior to doing so?

Anyway - let us know what you're thinking in terms of how NiFi would
know which object was problematic or that any were problematic for
that matter.

Thanks
Joe

On Tue, Feb 24, 2015 at 9:28 AM, Mike Drob <[email protected]> wrote:
> NiFi experts,
>
> Let's say that I want to send data from NiFi to some destination that works
> much better when the documents are batched. I do not think this is an
> unreasonable ask.
>
> I imagine that I would want to first combine all of the records in one
> processor, and then pass on to a dedicated processor for sending the data?
> I'm not sure yet if I would be able to use existing processors for this, or
> if I could create my own, but this part feels fairly straightforward.
>
> Next, let's imagine that some document in the batch causes it to fail. I
> would like to un-batch, and create smaller batches, and try to send those,
> assuming that some piece of the data was malformed and not a transient
> error like network unavailable. Is this pattern workable? I can imagine
> several layers of fail/split/retry to winnow from 1000 documents to 100 to
> 10 to 1, so that I can still get most of my data sent and know exactly
> which documents fail.
>
> I'm largely thinking out loud here, somebody stop me if I'm off the deep
> end, or if this has been done before and we have examples (I didn't see any
> readily apparent).
>
> Mike

Re: Batch Processors

Reply via email to