Re: Parallel optimize of index on SolrCloud.

Timothy Potter Wed, 09 Jul 2014 07:50:30 -0700

Hi Modassar,

Have you tried hitting the cores for each replica directly (instead of
using the collection)? i.e. if you had col_shard1_replica1 on node1,
then send the optimize command to that core URL directly:


curl -i -v "http://host:port/solr/col_shard1_replica1/update"; -H
'Content-type:application/xml' \
  --data-binary "<optimize/>"

I haven't tried this myself but might work ;-)

Tim

On Wed, Jul 9, 2014 at 12:59 AM, Modassar Ather <modather1...@gmail.com> wrote:
> Hi All,
>
> Thanks for your kind suggestions and inputs.
>
> We have been going the optimize way and it has helped. There have been
> testing and benchmarking already done around memory and performance.
> So while optimizing we see a scope of improvement on it by doing it
> parallel so kindly suggest in what way it can be achieved.
>
> Thanks,
> Modassar
>
>
> On Wed, Jul 9, 2014 at 11:48 AM, Shalin Shekhar Mangar <
> shalinman...@gmail.com> wrote:
>
>> Hi Walter,
>>
>> I wonder why you think SolrCloud isn't necessary if you're indexing once
>> per week. Isn't the automatic failover and auto-sharding still useful? One
>> can also do custom sharding with SolrCloud if necessary.
>>
>>
>> On Wed, Jul 9, 2014 at 11:38 AM, Walter Underwood <wun...@wunderwood.org>
>> wrote:
>>
>> > More memory or faster disks will make a much bigger improvement than a
>> > forced merge.
>> >
>> > What are you measuring? If it is average query time, that is not a good
>> > measure. Look at 90th or 95th percentile. Test with queries from logs.
>> >
>> > No user can see a 10% or 20% difference. If your managers are watching
>> > that, they are watching the wrong thing.
>> >
>> > If you are indexing once per week, you don't really need the complexity
>> of
>> > Solr Cloud. You can do manual sharding.
>> >
>> > wunder
>> >
>> > On Jul 8, 2014, at 10:55 PM, Modassar Ather <modather1...@gmail.com>
>> > wrote:
>> >
>> > > Our index has almost 100M documents running on SolrCloud of 3 shards
>> and
>> > > each shard has an index size of about 700GB (for the record, we are not
>> > > using stored fields - our documents are pretty large). We perform a
>> full
>> > > indexing every weekend and during the week there are no updates made to
>> > the
>> > > index. Most of the queries that we run are pretty complex with hundreds
>> > of
>> > > terms using PhraseQuery, BooleanQuery, SpanQuery, Wildcards, boosts
>> etc.
>> > > and take many minutes to execute. A difference of 10-20% is also a big
>> > > advantage for us.
>> > >
>> > > We have been optimizing the index after indexing for years and it has
>> > > worked well for us. Every once in a while, we upgrade Solr to the
>> latest
>> > > version and try without optimizing so that we can save the many hours
>> it
>> > > take to optimize such a huge index, but it does not work well.
>> > >
>> > > Kindly provide your suggestion.
>> > >
>> > > Thanks,
>> > > Modassar
>> > >
>> > >
>> > > On Wed, Jul 9, 2014 at 10:47 AM, Walter Underwood <
>> wun...@wunderwood.org
>> > >
>> > > wrote:
>> > >
>> > >> I seriously doubt that you are required to force merge.
>> > >>
>> > >> How much improvement? And is the big performance cost also OK?
>> > >>
>> > >> I have worked on search engines that do automatic merges and offer
>> > forced
>> > >> merges for over fifteen years. For all that time, forced merges have
>> > >> usually caused problems.
>> > >>
>> > >> Stop doing forced merges.
>> > >>
>> > >> wunder
>> > >>
>> > >> On Jul 8, 2014, at 10:09 PM, Modassar Ather <modather1...@gmail.com>
>> > >> wrote:
>> > >>
>> > >>> Thanks Walter for your inputs.
>> > >>>
>> > >>> Our use case and performance benchmark requires us to invoke
>> optimize.
>> > >>>
>> > >>> Here we see a chance of improvement in performance of optimize() if
>> > >> invoked
>> > >>> in parallel.
>> > >>> I found that if* distrib=false *is used, the optimization will happen
>> > in
>> > >>> parallel.
>> > >>>
>> > >>> But I could not find a way to set it using
>> > >> HttpSolrServer/CloudSolrServer.
>> > >>> Also with the parameter setting as given in my mail above does not
>> > seems
>> > >> to
>> > >>> work.
>> > >>>
>> > >>> Please let me know in what ways I can achieve the parallel optimize
>> on
>> > >>> SolrCloud.
>> > >>>
>> > >>> Thanks,
>> > >>> Modassar
>> > >>>
>> > >>> On Tue, Jul 8, 2014 at 7:53 PM, Walter Underwood <
>> > wun...@wunderwood.org>
>> > >>> wrote:
>> > >>>
>> > >>>> You probably do not need to force merge (mistakenly called
>> "optimize")
>> > >>>> your index.
>> > >>>>
>> > >>>> Solr does automatic merges, which work just fine.
>> > >>>>
>> > >>>> There are only a few situations where a forced merge is even a good
>> > >> idea.
>> > >>>> The most common one is a replicated (non-cloud) setup with a full
>> > >> reindex
>> > >>>> every night.
>> > >>>>
>> > >>>> If you need Solr Cloud, I cannot think of a situation where you
>> would
>> > >> want
>> > >>>> a forced merge.
>> > >>>>
>> > >>>> wunder
>> > >>>>
>> > >>>> On Jul 8, 2014, at 2:01 AM, Modassar Ather <modather1...@gmail.com>
>> > >> wrote:
>> > >>>>
>> > >>>>> Hi,
>> > >>>>>
>> > >>>>> Need to optimize index created using CloudSolrServer APIs under
>> > >> SolrCloud
>> > >>>>> setup of 3 instances on separate machines. Currently it optimizes
>> > >>>>> sequentially if I invoke cloudSolrServer.optimize().
>> > >>>>>
>> > >>>>> To make it parallel I tried making three separate HttpSolrServer
>> > >>>> instances
>> > >>>>> and invoked httpSolrServer.opimize() on them parallely but still it
>> > >> seems
>> > >>>>> to be doing optimization sequentially.
>> > >>>>>
>> > >>>>> I tried invoking optimize directly using HttpPost with following
>> url
>> > >> and
>> > >>>>> parameters but still it seems to be sequential.
>> > >>>>> *URL* : http://host:port/solr/collection/update
>> > >>>>>
>> > >>>>> *Parameters*:
>> > >>>>> params.add(new BasicNameValuePair("optimize", "true"));
>> > >>>>> params.add(new BasicNameValuePair("maxSegments", "1"));
>> > >>>>> params.add(new BasicNameValuePair("waitFlush", "true"));
>> > >>>>> params.add(new BasicNameValuePair("distrib", "false"));
>> > >>>>>
>> > >>>>> Kindly provide your suggestion and help.
>> > >>>>>
>> > >>>>> Regards,
>> > >>>>> Modassar
>> > >>>>
>> > >>>>
>> > >>>>
>> > >>>>
>> > >>>>
>> > >>
>> > >> --
>> > >> Walter Underwood
>> > >> wun...@wunderwood.org
>> > >>
>> > >>
>> > >>
>> > >>
>> >
>> > --
>> > Walter Underwood
>> > wun...@wunderwood.org
>> >
>> >
>> >
>> >
>>
>>
>> --
>> Regards,
>> Shalin Shekhar Mangar.
>>

Re: Parallel optimize of index on SolrCloud.

Reply via email to