Re: Performance improvement for solr faceting on large index

Otis Gospodnetic Sat, 24 Nov 2012 19:47:04 -0800

Hi Pravin,

Those unigrams... how are you using them?  What are the queries like?
I wonder if it's the (probably) massive number of terms in your index
that's the problem.


When queries are in flight and your CPU is 100% busy, do a few thread dumps
(kill -3 PID) and look where the threads are.  That will point you in the
right direction.

Otis
--
SOLR Performance Monitoring - http://sematext.com/spm/index.html
Search Analytics - http://sematext.com/search-analytics/index.html




On Fri, Nov 23, 2012 at 5:14 AM, Pravin Agrawal <
pravin_agra...@persistent.co.in> wrote:

> Thanks Yuval and Otis for the reply.
>
> Yuval: I tried different combination of facet.method (fc and enum) and
> filtercache size but there was not much improvement in the processing time.
>
> Otis: We have a plan in future to move this processing out of solr but it
> will be a large code change at this point in time.
> I know that outputting unitgram can be expensive, but we need to keep them
> :(.
> The memory of the solr server that we are using is 128GB out of which we
> have assigned 64 GB to solr. We observed that solr threads are using 100%
> CPU when request is in process.
> We are trying to divide this index further on 4 shards to reduce the index
> size per shard.
>
> Need to ask few more questions that we have a large number of unique terms
> in our index so whether facet method fc is better or enum? and
> Can a large facet.enum.cache.minDf value help ?
>
>
> Thanks,
> Pravin Agrawal
>
> -----Original Message-----
> From: Otis Gospodnetic [mailto:otis.gospodne...@gmail.com]
> Sent: Friday, November 23, 2012 6:37 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Performance improvement for solr faceting on large index
>
> Hi,
>
> I don't quite follow what you are trying gyroscope do, but it almost sounds
> like you may be better off using something other than Solr if all you are
> doing is filtering by site and counting something.
> I see unigrams in what looks like it could be a big field and that's a red
> flag.
> Your index is quite big - how much memory have you got?  Do those queries
> produce a lot of disk IO. I have a feeling they do. If so, your shards may
> be too large for your hardware.
>
> Otis
> --
> _________________________
> From: Yuval Dotan [yuvaldo...@gmail.com]
> Sent: Thursday, November 22, 2012 7:34 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Performance improvement for solr faceting on large index
>
> you could always try the fc facet method and maybe increase the filtercache
> size
>
> On Thu, Nov 22, 2012 at 2:53 PM, Pravin Agrawal <
> pravin_agra...@persistent.co.in> wrote:
>
> > Hi All,
> >
> > We are using solr 3.4 with following schema fields.
> >
> >
> >
> <schema.xml>---------------------------------------------------------------------------------------
> >
> > <fieldType name="autosuggest_text" class="solr.TextField"
> >             positionIncrementGap="100">
> >             <analyzer type="index">
> >                 <tokenizer class="solr.StandardTokenizerFactory"/>
> >                 <filter class="solr.LowerCaseFilterFactory"/>
> >                 <filter class="solr.ShingleFilterFactory"
> > maxShingleSize="5" outputUnigrams="true"/>
> >                 <filter class="solr.PatternReplaceFilterFactory"
> > pattern="^([0-9. ])*$" replacement=""
> >                     replace="all"/>
> >                 <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
> >             </analyzer>
> >             <analyzer type="query">
> >                 <tokenizer class="solr.StandardTokenizerFactory"/>
> >                 <filter class="solr.LowerCaseFilterFactory"/>
> >             </analyzer>
> >         </fieldType>
> >
> > <field name="id" type="string" stored="true" indexed="true"/>
> > <field name="autoSuggestContent" type="autosuggest_text" stored="true"
> > indexed="true" multiValued="true"/>
> >         <copyField source="content" dest="autoSuggestContent"/>
> >         <copyField source="original_title" dest="autoSuggestContent"/>
> >
> > <field name="content" type="text" stored="true" indexed="true"/>
> > <field name="original_title" type="text" stored="true" indexed="true"/>
> > <field name="site" type="site" stored="false" indexed="true"/>
> >
> >
> >
> </schema.xml>---------------------------------------------------------------------------------------
> >
> > The index on above schema is distributed on two solr shards with each
> > index size of about 1.2 million, and size on disk of about 195GB per
> shard.
> >
> > We want to retrieve (site, autoSuggestContent term, frequency of the
> term)
> > information from our above main solr index. The site is a field in
> document
> > and contains name of site to which that document belongs. The terms are
> > retrieved from multivalued field autoSuggestContent which is created
> using
> > shingles from content and title of the web page.
> >
> > As of now, we are using facet query to retrieve (term, frequency of term)
> >  for each site. Below is a sample query (you may ignore initial part of
> > query)
> >
> >
> >
> http://localhost:8080/solr/select?indent=on&q=*:*&fq=site:www.abc.com&start=0&rows=0&fl=id&qt=dismax&facet=true&facet.field=autoSuggestContent&facet.mincount=25&facet.limit=-1&facet.method=enum&facet.sort=index
> >
> > The problem is that with increase in index size, this method has started
> > taking huge time. It used to take 7 minutes per site with index size of
> > 0.4 million docs but takes around 60-90 minutes for index size of 2.5
> > million(). With this speed, it will take around 5-6 days to index
> complete
> > 1500 sites. Also we are expecting the index size to grow with more
> > documents and more sites and as such time to get the above information
> will
> > increase further.
> >
> > Please let us know if there is any better way to extract (site, term,
> > frequency) information compare to current method.
> >
> > Thanks,
> > Pravin Agrawal
> >
> >
> >
> >
> > DISCLAIMER
> > ==========
> > This e-mail may contain privileged and confidential information which is
> > the property of Persistent Systems Ltd. It is intended only for the use
> of
> > the individual or entity to which it is addressed. If you are not the
> > intended recipient, you are not authorized to read, retain, copy, print,
> > distribute or use this message. If you have received this communication
> in
> > error, please notify the sender and delete all copies of this message.
> > Persistent Systems Ltd. does not accept any liability for virus infected
> > mails.
> >
>
> DISCLAIMER
> ==========
> This e-mail may contain privileged and confidential information which is
> the property of Persistent Systems Ltd. It is intended only for the use of
> the individual or entity to which it is addressed. If you are not the
> intended recipient, you are not authorized to read, retain, copy, print,
> distribute or use this message. If you have received this communication in
> error, please notify the sender and delete all copies of this message.
> Persistent Systems Ltd. does not accept any liability for virus infected
> mails.
>

Re: Performance improvement for solr faceting on large index

Reply via email to