maybe introduce a distributed queue such as apache ignite, hazelcast or even redis. Read from the queue in batches, do your lookup then index the same batch.
just a thought. Mike St. John. On Nov 3, 2016 3:58 PM, "Erick Erickson" <erickerick...@gmail.com> wrote: > I thought we might be talking past each other... > > I think you're into "roll your own" here. Anything that > accumulated docs for a while, did a batch lookup > on the external system, then passed on the docs > runs the risk of losing docs if the server is abnormally > shut down. > > I guess ideally you'd like to augment the list coming in > rather than the docs once they're removed from the > incoming batch and passed on, but I admit I have no > clue where to do that. Possibly in an update chain? If > so, you'd need to be careful to only augment when > they'd reached their final shard leader or all at once > before distribution to shard leaders. > > Is the expense for the external lookup doing the actual > lookups or establishing the connection? Would > having some kind of shared connection to the external > source be worthwhile? > > FWIW, > Erick > > On Thu, Nov 3, 2016 at 12:06 PM, Markus Jelsma > <markus.jel...@openindex.io> wrote: > > Hi - i believe i did not explain myself well enough. > > > > Getting the data in Solr is not a problem, various sources index docs to > Solr, all in fine batches as everyone should do indeed. The thing is that i > need to do some preprocessing before it is indexed. Normally, > UpdateProcessors are the way to go. I've made quite a few of them and they > work fine. > > > > The problem is, i need to do a remote lookup for each document being > indexed. Right now, i make an external connection for each doc being > indexed in the current UpdateProcessor. This is still fast. But the remote > backend supports batched lookups, which are faster. > > > > This is why i'd love to be able to buffer documents in an > UpdateProcessor, and if there are enough, i do a remote lookup for all of > them, do some processing and let them be indexed. > > > > Thanks, > > Markus > > > > > > > > -----Original message----- > >> From:Erick Erickson <erickerick...@gmail.com> > >> Sent: Thursday 3rd November 2016 19:18 > >> To: solr-user <solr-user@lucene.apache.org> > >> Subject: Re: UpdateProcessor as a batch > >> > >> I _thought_ you'd been around long enough to know about the options I > >> mentioned ;). > >> > >> Right. I'd guess you're in UpdateHandler.addDoc and there's really no > >> batching at that level that I know of. I'm pretty sure that even > >> indexing batches of 1,000 documents from, say, SolrJ go through this > >> method. > >> > >> I don't think there's much to be gained by any batching at this level, > >> it pretty immediately tells Lucene to index the doc. > >> > >> FWIW > >> Erick > >> > >> On Thu, Nov 3, 2016 at 11:10 AM, Markus Jelsma > >> <markus.jel...@openindex.io> wrote: > >> > Erick - in this case data can come from anywhere. There is one piece > of code all incoming documents, regardless of their origin, are passed > thru, the update handler and update processors of Solr. > >> > > >> > In my case that is the most convenient point to partially modify the > documents, instead of moving that logic to separate places. > >> > > >> > I've seen the ContentStream in SolrQueryResponse and i probably could > tear incoming data apart and put it back together again, but that would not > be so easy as working with already deserialized objects such as > SolrInputDocument. > >> > > >> > UpdateHandler doesn't seem to work on a list of documents, it looked > like it works on incoming stuff, not a whole list. I've also looked if i > could buffer a batch in UpdateProcessor, work on them, and release them, > but that seems impossible. > >> > > >> > Thanks, > >> > Markus > >> > > >> > -----Original message----- > >> >> From:Erick Erickson <erickerick...@gmail.com> > >> >> Sent: Thursday 3rd November 2016 18:57 > >> >> To: solr-user <solr-user@lucene.apache.org> > >> >> Subject: Re: UpdateProcessor as a batch > >> >> > >> >> Markus: > >> >> > >> >> How are you indexing? SolrJ has a client.add(List< > SolrInputDocument>) > >> >> form, and post.jar lets you add as many documents as you want in a > >> >> batch.... > >> >> > >> >> Best, > >> >> Erick > >> >> > >> >> On Thu, Nov 3, 2016 at 10:18 AM, Markus Jelsma > >> >> <markus.jel...@openindex.io> wrote: > >> >> > Hi - i need to process a batch of documents on update but i cannot > seem to find a point where i can hook in and process a list of > SolrInputDocuments, not in UpdateProcessor nor in UpdateHandler. > >> >> > > >> >> > For now i let it go and implemented it on a per-document basis, it > is fast, but i'd prefer batches. Is that possible at all? > >> >> > > >> >> > Thanks, > >> >> > Markus > >> >> > >> >