Re: Batch Processors

Mike Drob Tue, 24 Feb 2015 06:55:36 -0800

On Tue, Feb 24, 2015 at 8:35 AM, Joe Witt <[email protected]> wrote:


> Mike,
>
> This is extremely common.  Both sides of this are.  You have some
> low-latency or batch producer and you want to delivery to some low
> latency of batch receiver.
>
> This is what splitting is for (in the case of going large to small) or
> joining is for (in the case of 'batching').  MergeContent is designed
> for the batching/aggregation case.  It allows you to merge using a
> couple strategies with binary concatenation being the most common.
>
> The very classic example is receiving a live stream of data which
> needs to be sent to HDFS.  We'd setup MergeContent to aggregate data
> to a size that is close to or matches the desired HDFS block size.
>
> Now the part this is interesting that you mention is what if 'object'
> 45 of 100 causes a problem with the downstream system.  How
> would/could NiFi know about that object?  Is it not feasible to
> evaluate the data for its fitness to merge prior to doing so?
>
> NiFi would only know that the batch failed. Maybe we would know that it
failed with 'bad data' rather than 'connection timed out' but I don't think
we would know that it failed with 'bad data at #45'

The use case would be sending data with some kind of constraint. Maybe it
is that field 1 is numeric and field 2 is some kind of date-time, and NiFi
can validate the schema (if it knows about it). But there might also be
business rules field 3 cannot be empty if field 4 is empty. Field 5 must
match an existing username. These are certainly possible to validate in
NiFi but get much harder to do so.


> Anyway - let us know what you're thinking in terms of how NiFi would
> know which object was problematic or that any were problematic for
> that matter.
>
> Thanks
> Joe
>
> On Tue, Feb 24, 2015 at 9:28 AM, Mike Drob <[email protected]> wrote:
> > NiFi experts,
> >
> > Let's say that I want to send data from NiFi to some destination that
> works
> > much better when the documents are batched. I do not think this is an
> > unreasonable ask.
> >
> > I imagine that I would want to first combine all of the records in one
> > processor, and then pass on to a dedicated processor for sending the
> data?
> > I'm not sure yet if I would be able to use existing processors for this,
> or
> > if I could create my own, but this part feels fairly straightforward.
> >
> > Next, let's imagine that some document in the batch causes it to fail. I
> > would like to un-batch, and create smaller batches, and try to send
> those,
> > assuming that some piece of the data was malformed and not a transient
> > error like network unavailable. Is this pattern workable? I can imagine
> > several layers of fail/split/retry to winnow from 1000 documents to 100
> to
> > 10 to 1, so that I can still get most of my data sent and know exactly
> > which documents fail.
> >
> > I'm largely thinking out loud here, somebody stop me if I'm off the deep
> > end, or if this has been done before and we have examples (I didn't see
> any
> > readily apparent).
> >
> > Mike
>

Re: Batch Processors

Reply via email to