This might be useful. In this scenario you load you content into Solr for staging and perform your ETL from Solr to Solr:
http://joelsolr.blogspot.com/2016/10/solr-63-batch-jobs-parallel-etl-and.html Basically Solr becomes a text processing warehouse. Joel Bernstein http://joelsolr.blogspot.com/ On Thu, Nov 3, 2016 at 5:05 PM, Alexandre Rafalovitch <arafa...@gmail.com> wrote: > How big a batch we are talking about? > > Because I believe you could accumulate the docs in the first URP in > the processAdd and then do the batch lookup and actually processing of > them on processCommit. > > They are daisy chain, so as long as you are holding on to the chain, > the rest of the URPs don't happen. > > Obviously you are relying on the commit here to trigger the final call. > > Or you could do a two collection sequence with indexing to first > collection, querying for whatever you need to batch lookup and then > doing Collection-to-Collection enhanced copy. > > Regards, > Alex. > ---- > Solr Example reading group is starting November 2016, join us at > http://j.mp/SolrERG > Newsletter and resources for Solr beginners and intermediates: > http://www.solr-start.com/ > > > On 4 November 2016 at 07:35, mike st. john <mstj...@gmail.com> wrote: > > maybe introduce a distributed queue such as apache ignite, hazelcast or > > even redis. Read from the queue in batches, do your lookup then index > the > > same batch. > > > > just a thought. > > > > Mike St. John. > > > > On Nov 3, 2016 3:58 PM, "Erick Erickson" <erickerick...@gmail.com> > wrote: > > > >> I thought we might be talking past each other... > >> > >> I think you're into "roll your own" here. Anything that > >> accumulated docs for a while, did a batch lookup > >> on the external system, then passed on the docs > >> runs the risk of losing docs if the server is abnormally > >> shut down. > >> > >> I guess ideally you'd like to augment the list coming in > >> rather than the docs once they're removed from the > >> incoming batch and passed on, but I admit I have no > >> clue where to do that. Possibly in an update chain? If > >> so, you'd need to be careful to only augment when > >> they'd reached their final shard leader or all at once > >> before distribution to shard leaders. > >> > >> Is the expense for the external lookup doing the actual > >> lookups or establishing the connection? Would > >> having some kind of shared connection to the external > >> source be worthwhile? > >> > >> FWIW, > >> Erick > >> > >> On Thu, Nov 3, 2016 at 12:06 PM, Markus Jelsma > >> <markus.jel...@openindex.io> wrote: > >> > Hi - i believe i did not explain myself well enough. > >> > > >> > Getting the data in Solr is not a problem, various sources index docs > to > >> Solr, all in fine batches as everyone should do indeed. The thing is > that i > >> need to do some preprocessing before it is indexed. Normally, > >> UpdateProcessors are the way to go. I've made quite a few of them and > they > >> work fine. > >> > > >> > The problem is, i need to do a remote lookup for each document being > >> indexed. Right now, i make an external connection for each doc being > >> indexed in the current UpdateProcessor. This is still fast. But the > remote > >> backend supports batched lookups, which are faster. > >> > > >> > This is why i'd love to be able to buffer documents in an > >> UpdateProcessor, and if there are enough, i do a remote lookup for all > of > >> them, do some processing and let them be indexed. > >> > > >> > Thanks, > >> > Markus > >> > > >> > > >> > > >> > -----Original message----- > >> >> From:Erick Erickson <erickerick...@gmail.com> > >> >> Sent: Thursday 3rd November 2016 19:18 > >> >> To: solr-user <solr-user@lucene.apache.org> > >> >> Subject: Re: UpdateProcessor as a batch > >> >> > >> >> I _thought_ you'd been around long enough to know about the options I > >> >> mentioned ;). > >> >> > >> >> Right. I'd guess you're in UpdateHandler.addDoc and there's really no > >> >> batching at that level that I know of. I'm pretty sure that even > >> >> indexing batches of 1,000 documents from, say, SolrJ go through this > >> >> method. > >> >> > >> >> I don't think there's much to be gained by any batching at this > level, > >> >> it pretty immediately tells Lucene to index the doc. > >> >> > >> >> FWIW > >> >> Erick > >> >> > >> >> On Thu, Nov 3, 2016 at 11:10 AM, Markus Jelsma > >> >> <markus.jel...@openindex.io> wrote: > >> >> > Erick - in this case data can come from anywhere. There is one > piece > >> of code all incoming documents, regardless of their origin, are passed > >> thru, the update handler and update processors of Solr. > >> >> > > >> >> > In my case that is the most convenient point to partially modify > the > >> documents, instead of moving that logic to separate places. > >> >> > > >> >> > I've seen the ContentStream in SolrQueryResponse and i probably > could > >> tear incoming data apart and put it back together again, but that would > not > >> be so easy as working with already deserialized objects such as > >> SolrInputDocument. > >> >> > > >> >> > UpdateHandler doesn't seem to work on a list of documents, it > looked > >> like it works on incoming stuff, not a whole list. I've also looked if i > >> could buffer a batch in UpdateProcessor, work on them, and release them, > >> but that seems impossible. > >> >> > > >> >> > Thanks, > >> >> > Markus > >> >> > > >> >> > -----Original message----- > >> >> >> From:Erick Erickson <erickerick...@gmail.com> > >> >> >> Sent: Thursday 3rd November 2016 18:57 > >> >> >> To: solr-user <solr-user@lucene.apache.org> > >> >> >> Subject: Re: UpdateProcessor as a batch > >> >> >> > >> >> >> Markus: > >> >> >> > >> >> >> How are you indexing? SolrJ has a client.add(List< > >> SolrInputDocument>) > >> >> >> form, and post.jar lets you add as many documents as you want in a > >> >> >> batch.... > >> >> >> > >> >> >> Best, > >> >> >> Erick > >> >> >> > >> >> >> On Thu, Nov 3, 2016 at 10:18 AM, Markus Jelsma > >> >> >> <markus.jel...@openindex.io> wrote: > >> >> >> > Hi - i need to process a batch of documents on update but i > cannot > >> seem to find a point where i can hook in and process a list of > >> SolrInputDocuments, not in UpdateProcessor nor in UpdateHandler. > >> >> >> > > >> >> >> > For now i let it go and implemented it on a per-document basis, > it > >> is fast, but i'd prefer batches. Is that possible at all? > >> >> >> > > >> >> >> > Thanks, > >> >> >> > Markus > >> >> >> > >> >> > >> >