Re: UpdateProcessor as a batch

Alexandre Rafalovitch Thu, 03 Nov 2016 14:07:53 -0700

How big a batch we are talking about?

Because I believe you could accumulate the docs in the first URP in
the processAdd and then do the batch lookup and actually processing of
them on processCommit.


They are daisy chain, so as long as you are holding on to the chain,
the rest of the URPs don't happen.

Obviously you are relying on the commit here to trigger the final call.

Or you could do a two collection sequence with indexing to first
collection, querying for whatever you need to batch lookup and then
doing Collection-to-Collection enhanced copy.

Regards,
   Alex.
----
Solr Example reading group is starting November 2016, join us at
http://j.mp/SolrERG
Newsletter and resources for Solr beginners and intermediates:
http://www.solr-start.com/


On 4 November 2016 at 07:35, mike st. john <mstj...@gmail.com> wrote:
> maybe introduce a distributed queue such as apache ignite,  hazelcast or
> even redis.   Read from the queue in batches, do your lookup then index the
> same batch.
>
> just a thought.
>
> Mike St. John.
>
> On Nov 3, 2016 3:58 PM, "Erick Erickson" <erickerick...@gmail.com> wrote:
>
>> I thought we might be talking past each other...
>>
>> I think you're into "roll your own" here. Anything that
>> accumulated docs for a while, did a batch lookup
>> on the external system, then passed on the docs
>> runs the risk of losing docs if the server is abnormally
>> shut down.
>>
>> I guess ideally you'd like to augment the list coming in
>> rather than the docs once they're removed from the
>> incoming batch and passed on, but I admit I have no
>> clue where to do that. Possibly in an update chain? If
>> so, you'd need to be careful to only augment when
>> they'd reached their final shard leader or all at once
>> before distribution to shard leaders.
>>
>> Is the expense for the external lookup doing the actual
>> lookups or establishing the connection? Would
>> having some kind of shared connection to the external
>> source be worthwhile?
>>
>> FWIW,
>> Erick
>>
>> On Thu, Nov 3, 2016 at 12:06 PM, Markus Jelsma
>> <markus.jel...@openindex.io> wrote:
>> > Hi - i believe i did not explain myself well enough.
>> >
>> > Getting the data in Solr is not a problem, various sources index docs to
>> Solr, all in fine batches as everyone should do indeed. The thing is that i
>> need to do some preprocessing before it is indexed. Normally,
>> UpdateProcessors are the way to go. I've made quite a few of them and they
>> work fine.
>> >
>> > The problem is, i need to do a remote lookup for each document being
>> indexed. Right now, i make an external connection for each doc being
>> indexed in the current UpdateProcessor. This is still fast. But the remote
>> backend supports batched lookups, which are faster.
>> >
>> > This is why i'd love to be able to buffer documents in an
>> UpdateProcessor, and if there are enough, i do a remote lookup for all of
>> them, do some processing and let them be indexed.
>> >
>> > Thanks,
>> > Markus
>> >
>> >
>> >
>> > -----Original message-----
>> >> From:Erick Erickson <erickerick...@gmail.com>
>> >> Sent: Thursday 3rd November 2016 19:18
>> >> To: solr-user <solr-user@lucene.apache.org>
>> >> Subject: Re: UpdateProcessor as a batch
>> >>
>> >> I _thought_ you'd been around long enough to know about the options I
>> >> mentioned ;).
>> >>
>> >> Right. I'd guess you're in UpdateHandler.addDoc and there's really no
>> >> batching at that level that I know of. I'm pretty sure that even
>> >> indexing batches of 1,000 documents from, say, SolrJ go through this
>> >> method.
>> >>
>> >> I don't think there's much to be gained by any batching at this level,
>> >> it pretty immediately tells Lucene to index the doc.
>> >>
>> >> FWIW
>> >> Erick
>> >>
>> >> On Thu, Nov 3, 2016 at 11:10 AM, Markus Jelsma
>> >> <markus.jel...@openindex.io> wrote:
>> >> > Erick - in this case data can come from anywhere. There is one piece
>> of code all incoming documents, regardless of their origin, are passed
>> thru, the update handler and update processors of Solr.
>> >> >
>> >> > In my case that is the most convenient point to partially modify the
>> documents, instead of moving that logic to separate places.
>> >> >
>> >> > I've seen the ContentStream in SolrQueryResponse and i probably could
>> tear incoming data apart and put it back together again, but that would not
>> be so easy as working with already deserialized objects such as
>> SolrInputDocument.
>> >> >
>> >> > UpdateHandler doesn't seem to work on a list of documents, it looked
>> like it works on incoming stuff, not a whole list. I've also looked if i
>> could buffer a batch in UpdateProcessor, work on them, and release them,
>> but that seems impossible.
>> >> >
>> >> > Thanks,
>> >> > Markus
>> >> >
>> >> > -----Original message-----
>> >> >> From:Erick Erickson <erickerick...@gmail.com>
>> >> >> Sent: Thursday 3rd November 2016 18:57
>> >> >> To: solr-user <solr-user@lucene.apache.org>
>> >> >> Subject: Re: UpdateProcessor as a batch
>> >> >>
>> >> >> Markus:
>> >> >>
>> >> >> How are you indexing? SolrJ has a client.add(List<
>> SolrInputDocument>)
>> >> >> form, and post.jar lets you add as many documents as you want in a
>> >> >> batch....
>> >> >>
>> >> >> Best,
>> >> >> Erick
>> >> >>
>> >> >> On Thu, Nov 3, 2016 at 10:18 AM, Markus Jelsma
>> >> >> <markus.jel...@openindex.io> wrote:
>> >> >> > Hi - i need to process a batch of documents on update but i cannot
>> seem to find a point where i can hook in and process a list of
>> SolrInputDocuments, not in UpdateProcessor nor in UpdateHandler.
>> >> >> >
>> >> >> > For now i let it go and implemented it on a per-document basis, it
>> is fast, but i'd prefer batches. Is that possible at all?
>> >> >> >
>> >> >> > Thanks,
>> >> >> > Markus
>> >> >>
>> >>
>>

Re: UpdateProcessor as a batch

Reply via email to