Re: Batch Processors

Bryan Bende Tue, 24 Feb 2015 07:45:11 -0800

Mike,

You might also want to look at BinFiles. Joe Gresock submitted a pull
request recently that separated the binning logic from MergeContent into a
base class (BinFiles).


So if you wanted more control over processing the batch, you could write a
processor that extends BinFiles. This might help in deciding what to do
with failures because you would have access to the individual FlowFiles
that make up the batch.

-Bryan

On Tue, Feb 24, 2015 at 10:04 AM, Joe Witt <[email protected]> wrote:

> Mike,
>
> Sounds like there are two approaches here (or perhaps a combination).
>
> First is to have the flow configured to detect bad data/validate good
> data before merging.
>
> Second is to wait for failures to occur then establish a configuration
> which allows you to predict/establish which object(s) was problematic.
>
> Both patterns have been employed both in the design of some processors
> and in the design of flows.  In my experience when you can get away
> with validation up front it is the more 'cost effective' approach.
> But your case may differ.
>
> If you want to dive into more specifics of your case happy to do so.
>
> Thanks
> Joe
>
>
> On Tue, Feb 24, 2015 at 9:53 AM, Mike Drob <[email protected]> wrote:
> > On Tue, Feb 24, 2015 at 8:35 AM, Joe Witt <[email protected]> wrote:
> >
> >> Mike,
> >>
> >> This is extremely common.  Both sides of this are.  You have some
> >> low-latency or batch producer and you want to delivery to some low
> >> latency of batch receiver.
> >>
> >> This is what splitting is for (in the case of going large to small) or
> >> joining is for (in the case of 'batching').  MergeContent is designed
> >> for the batching/aggregation case.  It allows you to merge using a
> >> couple strategies with binary concatenation being the most common.
> >>
> >> The very classic example is receiving a live stream of data which
> >> needs to be sent to HDFS.  We'd setup MergeContent to aggregate data
> >> to a size that is close to or matches the desired HDFS block size.
> >>
> >> Now the part this is interesting that you mention is what if 'object'
> >> 45 of 100 causes a problem with the downstream system.  How
> >> would/could NiFi know about that object?  Is it not feasible to
> >> evaluate the data for its fitness to merge prior to doing so?
> >>
> >> NiFi would only know that the batch failed. Maybe we would know that it
> > failed with 'bad data' rather than 'connection timed out' but I don't
> think
> > we would know that it failed with 'bad data at #45'
> >
> > The use case would be sending data with some kind of constraint. Maybe it
> > is that field 1 is numeric and field 2 is some kind of date-time, and
> NiFi
> > can validate the schema (if it knows about it). But there might also be
> > business rules field 3 cannot be empty if field 4 is empty. Field 5 must
> > match an existing username. These are certainly possible to validate in
> > NiFi but get much harder to do so.
> >
> >
> >> Anyway - let us know what you're thinking in terms of how NiFi would
> >> know which object was problematic or that any were problematic for
> >> that matter.
> >>
> >> Thanks
> >> Joe
> >>
> >> On Tue, Feb 24, 2015 at 9:28 AM, Mike Drob <[email protected]> wrote:
> >> > NiFi experts,
> >> >
> >> > Let's say that I want to send data from NiFi to some destination that
> >> works
> >> > much better when the documents are batched. I do not think this is an
> >> > unreasonable ask.
> >> >
> >> > I imagine that I would want to first combine all of the records in one
> >> > processor, and then pass on to a dedicated processor for sending the
> >> data?
> >> > I'm not sure yet if I would be able to use existing processors for
> this,
> >> or
> >> > if I could create my own, but this part feels fairly straightforward.
> >> >
> >> > Next, let's imagine that some document in the batch causes it to
> fail. I
> >> > would like to un-batch, and create smaller batches, and try to send
> >> those,
> >> > assuming that some piece of the data was malformed and not a transient
> >> > error like network unavailable. Is this pattern workable? I can
> imagine
> >> > several layers of fail/split/retry to winnow from 1000 documents to
> 100
> >> to
> >> > 10 to 1, so that I can still get most of my data sent and know exactly
> >> > which documents fail.
> >> >
> >> > I'm largely thinking out loud here, somebody stop me if I'm off the
> deep
> >> > end, or if this has been done before and we have examples (I didn't
> see
> >> any
> >> > readily apparent).
> >> >
> >> > Mike
> >>
>

Re: Batch Processors

Reply via email to