This might be useful. In this scenario you load you content into Solr for
staging and perform your ETL from Solr to Solr:

http://joelsolr.blogspot.com/2016/10/solr-63-batch-jobs-parallel-etl-and.html

Basically Solr becomes a text processing warehouse.

Joel Bernstein
http://joelsolr.blogspot.com/

On Thu, Nov 3, 2016 at 5:05 PM, Alexandre Rafalovitch <arafa...@gmail.com>
wrote:

> How big a batch we are talking about?
>
> Because I believe you could accumulate the docs in the first URP in
> the processAdd and then do the batch lookup and actually processing of
> them on processCommit.
>
> They are daisy chain, so as long as you are holding on to the chain,
> the rest of the URPs don't happen.
>
> Obviously you are relying on the commit here to trigger the final call.
>
> Or you could do a two collection sequence with indexing to first
> collection, querying for whatever you need to batch lookup and then
> doing Collection-to-Collection enhanced copy.
>
> Regards,
>    Alex.
> ----
> Solr Example reading group is starting November 2016, join us at
> http://j.mp/SolrERG
> Newsletter and resources for Solr beginners and intermediates:
> http://www.solr-start.com/
>
>
> On 4 November 2016 at 07:35, mike st. john <mstj...@gmail.com> wrote:
> > maybe introduce a distributed queue such as apache ignite,  hazelcast or
> > even redis.   Read from the queue in batches, do your lookup then index
> the
> > same batch.
> >
> > just a thought.
> >
> > Mike St. John.
> >
> > On Nov 3, 2016 3:58 PM, "Erick Erickson" <erickerick...@gmail.com>
> wrote:
> >
> >> I thought we might be talking past each other...
> >>
> >> I think you're into "roll your own" here. Anything that
> >> accumulated docs for a while, did a batch lookup
> >> on the external system, then passed on the docs
> >> runs the risk of losing docs if the server is abnormally
> >> shut down.
> >>
> >> I guess ideally you'd like to augment the list coming in
> >> rather than the docs once they're removed from the
> >> incoming batch and passed on, but I admit I have no
> >> clue where to do that. Possibly in an update chain? If
> >> so, you'd need to be careful to only augment when
> >> they'd reached their final shard leader or all at once
> >> before distribution to shard leaders.
> >>
> >> Is the expense for the external lookup doing the actual
> >> lookups or establishing the connection? Would
> >> having some kind of shared connection to the external
> >> source be worthwhile?
> >>
> >> FWIW,
> >> Erick
> >>
> >> On Thu, Nov 3, 2016 at 12:06 PM, Markus Jelsma
> >> <markus.jel...@openindex.io> wrote:
> >> > Hi - i believe i did not explain myself well enough.
> >> >
> >> > Getting the data in Solr is not a problem, various sources index docs
> to
> >> Solr, all in fine batches as everyone should do indeed. The thing is
> that i
> >> need to do some preprocessing before it is indexed. Normally,
> >> UpdateProcessors are the way to go. I've made quite a few of them and
> they
> >> work fine.
> >> >
> >> > The problem is, i need to do a remote lookup for each document being
> >> indexed. Right now, i make an external connection for each doc being
> >> indexed in the current UpdateProcessor. This is still fast. But the
> remote
> >> backend supports batched lookups, which are faster.
> >> >
> >> > This is why i'd love to be able to buffer documents in an
> >> UpdateProcessor, and if there are enough, i do a remote lookup for all
> of
> >> them, do some processing and let them be indexed.
> >> >
> >> > Thanks,
> >> > Markus
> >> >
> >> >
> >> >
> >> > -----Original message-----
> >> >> From:Erick Erickson <erickerick...@gmail.com>
> >> >> Sent: Thursday 3rd November 2016 19:18
> >> >> To: solr-user <solr-user@lucene.apache.org>
> >> >> Subject: Re: UpdateProcessor as a batch
> >> >>
> >> >> I _thought_ you'd been around long enough to know about the options I
> >> >> mentioned ;).
> >> >>
> >> >> Right. I'd guess you're in UpdateHandler.addDoc and there's really no
> >> >> batching at that level that I know of. I'm pretty sure that even
> >> >> indexing batches of 1,000 documents from, say, SolrJ go through this
> >> >> method.
> >> >>
> >> >> I don't think there's much to be gained by any batching at this
> level,
> >> >> it pretty immediately tells Lucene to index the doc.
> >> >>
> >> >> FWIW
> >> >> Erick
> >> >>
> >> >> On Thu, Nov 3, 2016 at 11:10 AM, Markus Jelsma
> >> >> <markus.jel...@openindex.io> wrote:
> >> >> > Erick - in this case data can come from anywhere. There is one
> piece
> >> of code all incoming documents, regardless of their origin, are passed
> >> thru, the update handler and update processors of Solr.
> >> >> >
> >> >> > In my case that is the most convenient point to partially modify
> the
> >> documents, instead of moving that logic to separate places.
> >> >> >
> >> >> > I've seen the ContentStream in SolrQueryResponse and i probably
> could
> >> tear incoming data apart and put it back together again, but that would
> not
> >> be so easy as working with already deserialized objects such as
> >> SolrInputDocument.
> >> >> >
> >> >> > UpdateHandler doesn't seem to work on a list of documents, it
> looked
> >> like it works on incoming stuff, not a whole list. I've also looked if i
> >> could buffer a batch in UpdateProcessor, work on them, and release them,
> >> but that seems impossible.
> >> >> >
> >> >> > Thanks,
> >> >> > Markus
> >> >> >
> >> >> > -----Original message-----
> >> >> >> From:Erick Erickson <erickerick...@gmail.com>
> >> >> >> Sent: Thursday 3rd November 2016 18:57
> >> >> >> To: solr-user <solr-user@lucene.apache.org>
> >> >> >> Subject: Re: UpdateProcessor as a batch
> >> >> >>
> >> >> >> Markus:
> >> >> >>
> >> >> >> How are you indexing? SolrJ has a client.add(List<
> >> SolrInputDocument>)
> >> >> >> form, and post.jar lets you add as many documents as you want in a
> >> >> >> batch....
> >> >> >>
> >> >> >> Best,
> >> >> >> Erick
> >> >> >>
> >> >> >> On Thu, Nov 3, 2016 at 10:18 AM, Markus Jelsma
> >> >> >> <markus.jel...@openindex.io> wrote:
> >> >> >> > Hi - i need to process a batch of documents on update but i
> cannot
> >> seem to find a point where i can hook in and process a list of
> >> SolrInputDocuments, not in UpdateProcessor nor in UpdateHandler.
> >> >> >> >
> >> >> >> > For now i let it go and implemented it on a per-document basis,
> it
> >> is fast, but i'd prefer batches. Is that possible at all?
> >> >> >> >
> >> >> >> > Thanks,
> >> >> >> > Markus
> >> >> >>
> >> >>
> >>
>

Reply via email to