Hi Pravin, Those unigrams... how are you using them? What are the queries like? I wonder if it's the (probably) massive number of terms in your index that's the problem.
When queries are in flight and your CPU is 100% busy, do a few thread dumps (kill -3 PID) and look where the threads are. That will point you in the right direction. Otis -- SOLR Performance Monitoring - http://sematext.com/spm/index.html Search Analytics - http://sematext.com/search-analytics/index.html On Fri, Nov 23, 2012 at 5:14 AM, Pravin Agrawal < pravin_agra...@persistent.co.in> wrote: > Thanks Yuval and Otis for the reply. > > Yuval: I tried different combination of facet.method (fc and enum) and > filtercache size but there was not much improvement in the processing time. > > Otis: We have a plan in future to move this processing out of solr but it > will be a large code change at this point in time. > I know that outputting unitgram can be expensive, but we need to keep them > :(. > The memory of the solr server that we are using is 128GB out of which we > have assigned 64 GB to solr. We observed that solr threads are using 100% > CPU when request is in process. > We are trying to divide this index further on 4 shards to reduce the index > size per shard. > > Need to ask few more questions that we have a large number of unique terms > in our index so whether facet method fc is better or enum? and > Can a large facet.enum.cache.minDf value help ? > > > Thanks, > Pravin Agrawal > > -----Original Message----- > From: Otis Gospodnetic [mailto:otis.gospodne...@gmail.com] > Sent: Friday, November 23, 2012 6:37 AM > To: solr-user@lucene.apache.org > Subject: Re: Performance improvement for solr faceting on large index > > Hi, > > I don't quite follow what you are trying gyroscope do, but it almost sounds > like you may be better off using something other than Solr if all you are > doing is filtering by site and counting something. > I see unigrams in what looks like it could be a big field and that's a red > flag. > Your index is quite big - how much memory have you got? Do those queries > produce a lot of disk IO. I have a feeling they do. If so, your shards may > be too large for your hardware. > > Otis > -- > _________________________ > From: Yuval Dotan [yuvaldo...@gmail.com] > Sent: Thursday, November 22, 2012 7:34 PM > To: solr-user@lucene.apache.org > Subject: Re: Performance improvement for solr faceting on large index > > you could always try the fc facet method and maybe increase the filtercache > size > > On Thu, Nov 22, 2012 at 2:53 PM, Pravin Agrawal < > pravin_agra...@persistent.co.in> wrote: > > > Hi All, > > > > We are using solr 3.4 with following schema fields. > > > > > > > <schema.xml>--------------------------------------------------------------------------------------- > > > > <fieldType name="autosuggest_text" class="solr.TextField" > > positionIncrementGap="100"> > > <analyzer type="index"> > > <tokenizer class="solr.StandardTokenizerFactory"/> > > <filter class="solr.LowerCaseFilterFactory"/> > > <filter class="solr.ShingleFilterFactory" > > maxShingleSize="5" outputUnigrams="true"/> > > <filter class="solr.PatternReplaceFilterFactory" > > pattern="^([0-9. ])*$" replacement="" > > replace="all"/> > > <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> > > </analyzer> > > <analyzer type="query"> > > <tokenizer class="solr.StandardTokenizerFactory"/> > > <filter class="solr.LowerCaseFilterFactory"/> > > </analyzer> > > </fieldType> > > > > <field name="id" type="string" stored="true" indexed="true"/> > > <field name="autoSuggestContent" type="autosuggest_text" stored="true" > > indexed="true" multiValued="true"/> > > <copyField source="content" dest="autoSuggestContent"/> > > <copyField source="original_title" dest="autoSuggestContent"/> > > > > <field name="content" type="text" stored="true" indexed="true"/> > > <field name="original_title" type="text" stored="true" indexed="true"/> > > <field name="site" type="site" stored="false" indexed="true"/> > > > > > > > </schema.xml>--------------------------------------------------------------------------------------- > > > > The index on above schema is distributed on two solr shards with each > > index size of about 1.2 million, and size on disk of about 195GB per > shard. > > > > We want to retrieve (site, autoSuggestContent term, frequency of the > term) > > information from our above main solr index. The site is a field in > document > > and contains name of site to which that document belongs. The terms are > > retrieved from multivalued field autoSuggestContent which is created > using > > shingles from content and title of the web page. > > > > As of now, we are using facet query to retrieve (term, frequency of term) > > for each site. Below is a sample query (you may ignore initial part of > > query) > > > > > > > http://localhost:8080/solr/select?indent=on&q=*:*&fq=site:www.abc.com&start=0&rows=0&fl=id&qt=dismax&facet=true&facet.field=autoSuggestContent&facet.mincount=25&facet.limit=-1&facet.method=enum&facet.sort=index > > > > The problem is that with increase in index size, this method has started > > taking huge time. It used to take 7 minutes per site with index size of > > 0.4 million docs but takes around 60-90 minutes for index size of 2.5 > > million(). With this speed, it will take around 5-6 days to index > complete > > 1500 sites. Also we are expecting the index size to grow with more > > documents and more sites and as such time to get the above information > will > > increase further. > > > > Please let us know if there is any better way to extract (site, term, > > frequency) information compare to current method. > > > > Thanks, > > Pravin Agrawal > > > > > > > > > > DISCLAIMER > > ========== > > This e-mail may contain privileged and confidential information which is > > the property of Persistent Systems Ltd. It is intended only for the use > of > > the individual or entity to which it is addressed. If you are not the > > intended recipient, you are not authorized to read, retain, copy, print, > > distribute or use this message. If you have received this communication > in > > error, please notify the sender and delete all copies of this message. > > Persistent Systems Ltd. does not accept any liability for virus infected > > mails. > > > > DISCLAIMER > ========== > This e-mail may contain privileged and confidential information which is > the property of Persistent Systems Ltd. It is intended only for the use of > the individual or entity to which it is addressed. If you are not the > intended recipient, you are not authorized to read, retain, copy, print, > distribute or use this message. If you have received this communication in > error, please notify the sender and delete all copies of this message. > Persistent Systems Ltd. does not accept any liability for virus infected > mails. >