On Tue, Feb 24, 2015 at 8:35 AM, Joe Witt <[email protected]> wrote:
> Mike, > > This is extremely common. Both sides of this are. You have some > low-latency or batch producer and you want to delivery to some low > latency of batch receiver. > > This is what splitting is for (in the case of going large to small) or > joining is for (in the case of 'batching'). MergeContent is designed > for the batching/aggregation case. It allows you to merge using a > couple strategies with binary concatenation being the most common. > > The very classic example is receiving a live stream of data which > needs to be sent to HDFS. We'd setup MergeContent to aggregate data > to a size that is close to or matches the desired HDFS block size. > > Now the part this is interesting that you mention is what if 'object' > 45 of 100 causes a problem with the downstream system. How > would/could NiFi know about that object? Is it not feasible to > evaluate the data for its fitness to merge prior to doing so? > > NiFi would only know that the batch failed. Maybe we would know that it failed with 'bad data' rather than 'connection timed out' but I don't think we would know that it failed with 'bad data at #45' The use case would be sending data with some kind of constraint. Maybe it is that field 1 is numeric and field 2 is some kind of date-time, and NiFi can validate the schema (if it knows about it). But there might also be business rules field 3 cannot be empty if field 4 is empty. Field 5 must match an existing username. These are certainly possible to validate in NiFi but get much harder to do so. > Anyway - let us know what you're thinking in terms of how NiFi would > know which object was problematic or that any were problematic for > that matter. > > Thanks > Joe > > On Tue, Feb 24, 2015 at 9:28 AM, Mike Drob <[email protected]> wrote: > > NiFi experts, > > > > Let's say that I want to send data from NiFi to some destination that > works > > much better when the documents are batched. I do not think this is an > > unreasonable ask. > > > > I imagine that I would want to first combine all of the records in one > > processor, and then pass on to a dedicated processor for sending the > data? > > I'm not sure yet if I would be able to use existing processors for this, > or > > if I could create my own, but this part feels fairly straightforward. > > > > Next, let's imagine that some document in the batch causes it to fail. I > > would like to un-batch, and create smaller batches, and try to send > those, > > assuming that some piece of the data was malformed and not a transient > > error like network unavailable. Is this pattern workable? I can imagine > > several layers of fail/split/retry to winnow from 1000 documents to 100 > to > > 10 to 1, so that I can still get most of my data sent and know exactly > > which documents fail. > > > > I'm largely thinking out loud here, somebody stop me if I'm off the deep > > end, or if this has been done before and we have examples (I didn't see > any > > readily apparent). > > > > Mike >
