Re: Best Indexing Approaches - To max the throughput

Gili Nachum Tue, 06 Oct 2015 11:43:46 -0700

CloudSolrServer <https://issues.apache.org/jira/browse/SOLR-4816> Beyond
sending documents to the right leader shard, it also do this in *parallel *(for
a batch), employing its own thread pool, with a connection per shard.


On Tue, Oct 6, 2015 at 8:15 PM, Walter Underwood <wun...@wunderwood.org>
wrote:

> This is at Chegg. One of our indexes is textbooks. These are expensive and
> don’t change very often. It is better to keep yesterday’s index than to
> drop a few important books.
>
> We have occasionally had an error that happens with every book, like a new
> field that is not in the Solr schema. If we ignored errors with that, we’d
> have an empty index: delete all, add all (failing), commit.
>
> With the fail fast and rollback, we can catch problems before they mess up
> the index.
>
> Also, to pinpoint isolated problems, if there is an error in the batch, it
> re-submits that batch one at a time, so we get an accurate report of which
> document was rejected. I wrote that same thing back at Netflix, before
> SolrJ.
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
>
> > On Oct 6, 2015, at 9:49 AM, Alessandro Benedetti <
> benedetti.ale...@gmail.com> wrote:
> >
> > Hi Walter,
> > can you explain better your use case ?
> > You index a batch of e-commerce products ( Solr documents) if one fails,
> > you want to stop and invalidate the entire batch ( using the almost never
> > used solr rollback, or manual deletion ?)
> > And then log the exception indexing size.
> > To then re-index the whole batch od docs ?
> >
> > In this scenario, the ConcurrentUpdateSolrClient will not be ideal?
> > Only curiosity.
> >
> > Cheers
> >
> > On 6 October 2015 at 17:29, Walter Underwood <wun...@wunderwood.org>
> wrote:
> >
> >> It depends on the document. In a e-commerce search, you might want to
> fail
> >> immediately and be notified. That is what we do, fail, rollback, and
> notify.
> >>
> >> wunder
> >> Walter Underwood
> >> wun...@wunderwood.org
> >> http://observer.wunderwood.org/  (my blog)
> >>
> >>
> >>> On Oct 6, 2015, at 7:58 AM, Alessandro Benedetti <
> >> benedetti.ale...@gmail.com> wrote:
> >>>
> >>> mmmmmm one broken document in a batch should not break the entire
> batch ,
> >>> right ( whatever approach used) ?
> >>> Are you referring to the fact that you want to programmatically
> re-index
> >>> the broken docs ?
> >>>
> >>> Would be interesting to return the id of the broken docs along with the
> >>> solr update response!
> >>>
> >>> Cheers
> >>>
> >>>
> >>> On 6 October 2015 at 15:30, Bill Dueber <b...@dueber.com> wrote:
> >>>
> >>>> Just to add...my informal tests show that batching has waaaaay more
> >> effect
> >>>> than solrj vs json.
> >>>>
> >>>> I haven't look at CUSC in a while, last time I looked it was
> impossible
> >> to
> >>>> do anything smart about error handling, so check that out before you
> get
> >>>> too deeply into it. We use a strategy of sending a batch of json
> >> documents,
> >>>> and if it returns an error sending each record one at a time until we
> >> find
> >>>> the bad one and can log something useful.
> >>>>
> >>>>
> >>>>
> >>>> On Mon, Oct 5, 2015 at 12:07 PM, Alessandro Benedetti <
> >>>> benedetti.ale...@gmail.com> wrote:
> >>>>
> >>>>> Thanks Erick,
> >>>>> you confirmed my impressions!
> >>>>> Thank you very much for the insights, an other opinion is welcome :)
> >>>>>
> >>>>> Cheers
> >>>>>
> >>>>> 2015-10-05 14:55 GMT+01:00 Erick Erickson <erickerick...@gmail.com>:
> >>>>>
> >>>>>> SolrJ tends to be faster for several reasons, not the least of which
> >>>>>> is that it sends packets to Solr in a more efficient binary format.
> >>>>>>
> >>>>>> Batching is critical. I did some rough tests using SolrJ and sending
> >>>>>> docs one at a time gave a throughput of < 400 docs/second.
> >>>>>> Sending 10 gave 2,300 or so. Sending 100 at a time gave
> >>>>>> over 5,300 docs/second. Curiously, 1,000 at a time gave only
> >>>>>> marginal improvement over 100. This was with a single thread.
> >>>>>> YMMV of course.
> >>>>>>
> >>>>>> CloudSolrClient is definitely the better way to go with SolrCloud,
> >>>>>> it routes the docs to the correct leader instead of having the
> >>>>>> node you send the docs to do the routing.
> >>>>>>
> >>>>>> Best,
> >>>>>> Erick
> >>>>>>
> >>>>>> On Mon, Oct 5, 2015 at 4:57 AM, Alessandro Benedetti
> >>>>>> <abenede...@apache.org> wrote:
> >>>>>>> I was doing some studies and analysis, just wondering in your
> opinion
> >>>>>> which
> >>>>>>> one is the best approach to use to index in Solr to reach the best
> >>>>>>> throughput possible.
> >>>>>>> I know that a lot of factor are affecting Indexing time, so let's
> >>>> only
> >>>>>>> focus in the feeding approach.
> >>>>>>> Let's isolate different scenarios :
> >>>>>>>
> >>>>>>> *Single Solr Infrastructure*
> >>>>>>>
> >>>>>>> 1) Xml/Json batch request to /update IndexHandler (xml/json)
> >>>>>>>
> >>>>>>> 2) SolrJ ConcurrentUpdateSolrClient ( javabin)
> >>>>>>> I was thinking this to be the fastest approach for a multi threaded
> >>>>>>> indexing application.
> >>>>>>> Posting batch of docs if possible per request.
> >>>>>>>
> >>>>>>> *Solr Cloud*
> >>>>>>>
> >>>>>>> 1) Xml/Json batch request to /update IndexHandler(xml/json)
> >>>>>>>
> >>>>>>> 2) SolrJ ConcurrentUpdateSolrClient ( javabin)
> >>>>>>>
> >>>>>>> 3) CloudSolrClient ( javabin)
> >>>>>>> it seems the best approach accordingly to this improvements [1]
> >>>>>>>
> >>>>>>> What are your opinions ?
> >>>>>>>
> >>>>>>> A bonus observation should be for using some Map/Reduce big data
> >>>>> indexer,
> >>>>>>> but let's assume we don't have a big cluster of cpus, but the
> average
> >>>>>>> Indexer server.
> >>>>>>>
> >>>>>>>
> >>>>>>> [1]
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>
> https://lucidworks.com/blog/indexing-performance-solr-5-2-now-twice-fast/
> >>>>>>>
> >>>>>>>
> >>>>>>> Cheers
> >>>>>>>
> >>>>>>>
> >>>>>>> --
> >>>>>>> --------------------------
> >>>>>>>
> >>>>>>> Benedetti Alessandro
> >>>>>>> Visiting card : http://about.me/alessandro_benedetti
> >>>>>>>
> >>>>>>> "Tyger, tyger burning bright
> >>>>>>> In the forests of the night,
> >>>>>>> What immortal hand or eye
> >>>>>>> Could frame thy fearful symmetry?"
> >>>>>>>
> >>>>>>> William Blake - Songs of Experience -1794 England
> >>>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> --
> >>>>> --------------------------
> >>>>>
> >>>>> Benedetti Alessandro
> >>>>> Visiting card - http://about.me/alessandro_benedetti
> >>>>> Blog - http://alexbenedetti.blogspot.co.uk
> >>>>>
> >>>>> "Tyger, tyger burning bright
> >>>>> In the forests of the night,
> >>>>> What immortal hand or eye
> >>>>> Could frame thy fearful symmetry?"
> >>>>>
> >>>>> William Blake - Songs of Experience -1794 England
> >>>>>
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>> Bill Dueber
> >>>> Library Systems Programmer
> >>>> University of Michigan Library
> >>>>
> >>>
> >>>
> >>>
> >>> --
> >>> --------------------------
> >>>
> >>> Benedetti Alessandro
> >>> Visiting card - http://about.me/alessandro_benedetti
> >>> Blog - http://alexbenedetti.blogspot.co.uk
> >>>
> >>> "Tyger, tyger burning bright
> >>> In the forests of the night,
> >>> What immortal hand or eye
> >>> Could frame thy fearful symmetry?"
> >>>
> >>> William Blake - Songs of Experience -1794 England
> >>
> >>
> >
> >
> > --
> > --------------------------
> >
> > Benedetti Alessandro
> > Visiting card - http://about.me/alessandro_benedetti
> > Blog - http://alexbenedetti.blogspot.co.uk
> >
> > "Tyger, tyger burning bright
> > In the forests of the night,
> > What immortal hand or eye
> > Could frame thy fearful symmetry?"
> >
> > William Blake - Songs of Experience -1794 England
>
>

Re: Best Indexing Approaches - To max the throughput

Reply via email to