maybe introduce a distributed queue such as apache ignite,  hazelcast or
even redis.   Read from the queue in batches, do your lookup then index the
same batch.

just a thought.

Mike St. John.

On Nov 3, 2016 3:58 PM, "Erick Erickson" <erickerick...@gmail.com> wrote:

> I thought we might be talking past each other...
>
> I think you're into "roll your own" here. Anything that
> accumulated docs for a while, did a batch lookup
> on the external system, then passed on the docs
> runs the risk of losing docs if the server is abnormally
> shut down.
>
> I guess ideally you'd like to augment the list coming in
> rather than the docs once they're removed from the
> incoming batch and passed on, but I admit I have no
> clue where to do that. Possibly in an update chain? If
> so, you'd need to be careful to only augment when
> they'd reached their final shard leader or all at once
> before distribution to shard leaders.
>
> Is the expense for the external lookup doing the actual
> lookups or establishing the connection? Would
> having some kind of shared connection to the external
> source be worthwhile?
>
> FWIW,
> Erick
>
> On Thu, Nov 3, 2016 at 12:06 PM, Markus Jelsma
> <markus.jel...@openindex.io> wrote:
> > Hi - i believe i did not explain myself well enough.
> >
> > Getting the data in Solr is not a problem, various sources index docs to
> Solr, all in fine batches as everyone should do indeed. The thing is that i
> need to do some preprocessing before it is indexed. Normally,
> UpdateProcessors are the way to go. I've made quite a few of them and they
> work fine.
> >
> > The problem is, i need to do a remote lookup for each document being
> indexed. Right now, i make an external connection for each doc being
> indexed in the current UpdateProcessor. This is still fast. But the remote
> backend supports batched lookups, which are faster.
> >
> > This is why i'd love to be able to buffer documents in an
> UpdateProcessor, and if there are enough, i do a remote lookup for all of
> them, do some processing and let them be indexed.
> >
> > Thanks,
> > Markus
> >
> >
> >
> > -----Original message-----
> >> From:Erick Erickson <erickerick...@gmail.com>
> >> Sent: Thursday 3rd November 2016 19:18
> >> To: solr-user <solr-user@lucene.apache.org>
> >> Subject: Re: UpdateProcessor as a batch
> >>
> >> I _thought_ you'd been around long enough to know about the options I
> >> mentioned ;).
> >>
> >> Right. I'd guess you're in UpdateHandler.addDoc and there's really no
> >> batching at that level that I know of. I'm pretty sure that even
> >> indexing batches of 1,000 documents from, say, SolrJ go through this
> >> method.
> >>
> >> I don't think there's much to be gained by any batching at this level,
> >> it pretty immediately tells Lucene to index the doc.
> >>
> >> FWIW
> >> Erick
> >>
> >> On Thu, Nov 3, 2016 at 11:10 AM, Markus Jelsma
> >> <markus.jel...@openindex.io> wrote:
> >> > Erick - in this case data can come from anywhere. There is one piece
> of code all incoming documents, regardless of their origin, are passed
> thru, the update handler and update processors of Solr.
> >> >
> >> > In my case that is the most convenient point to partially modify the
> documents, instead of moving that logic to separate places.
> >> >
> >> > I've seen the ContentStream in SolrQueryResponse and i probably could
> tear incoming data apart and put it back together again, but that would not
> be so easy as working with already deserialized objects such as
> SolrInputDocument.
> >> >
> >> > UpdateHandler doesn't seem to work on a list of documents, it looked
> like it works on incoming stuff, not a whole list. I've also looked if i
> could buffer a batch in UpdateProcessor, work on them, and release them,
> but that seems impossible.
> >> >
> >> > Thanks,
> >> > Markus
> >> >
> >> > -----Original message-----
> >> >> From:Erick Erickson <erickerick...@gmail.com>
> >> >> Sent: Thursday 3rd November 2016 18:57
> >> >> To: solr-user <solr-user@lucene.apache.org>
> >> >> Subject: Re: UpdateProcessor as a batch
> >> >>
> >> >> Markus:
> >> >>
> >> >> How are you indexing? SolrJ has a client.add(List<
> SolrInputDocument>)
> >> >> form, and post.jar lets you add as many documents as you want in a
> >> >> batch....
> >> >>
> >> >> Best,
> >> >> Erick
> >> >>
> >> >> On Thu, Nov 3, 2016 at 10:18 AM, Markus Jelsma
> >> >> <markus.jel...@openindex.io> wrote:
> >> >> > Hi - i need to process a batch of documents on update but i cannot
> seem to find a point where i can hook in and process a list of
> SolrInputDocuments, not in UpdateProcessor nor in UpdateHandler.
> >> >> >
> >> >> > For now i let it go and implemented it on a per-document basis, it
> is fast, but i'd prefer batches. Is that possible at all?
> >> >> >
> >> >> > Thanks,
> >> >> > Markus
> >> >>
> >>
>

Reply via email to