RE: Performance improvement for solr faceting on large index

Pravin Agrawal Fri, 23 Nov 2012 02:14:59 -0800

Thanks Yuval and Otis for the reply.

Yuval: I tried different combination of facet.method (fc and enum) and 
filtercache size but there was not much improvement in the processing time.


Otis: We have a plan in future to move this processing out of solr but it will 
be a large code change at this point in time.
I know that outputting unitgram can be expensive, but we need to keep them :(.
The memory of the solr server that we are using is 128GB out of which we have 
assigned 64 GB to solr. We observed that solr threads are using 100% CPU when 
request is in process.
We are trying to divide this index further on 4 shards to reduce the index size 
per shard.

Need to ask few more questions that we have a large number of unique terms in 
our index so whether facet method fc is better or enum? and
Can a large facet.enum.cache.minDf value help ?


Thanks,
Pravin Agrawal

-----Original Message-----
From: Otis Gospodnetic [mailto:otis.gospodne...@gmail.com]
Sent: Friday, November 23, 2012 6:37 AM
To: solr-user@lucene.apache.org
Subject: Re: Performance improvement for solr faceting on large index

Hi,

I don't quite follow what you are trying gyroscope do, but it almost sounds
like you may be better off using something other than Solr if all you are
doing is filtering by site and counting something.
I see unigrams in what looks like it could be a big field and that's a red
flag.
Your index is quite big - how much memory have you got?  Do those queries
produce a lot of disk IO. I have a feeling they do. If so, your shards may
be too large for your hardware.

Otis
--
_________________________
From: Yuval Dotan [yuvaldo...@gmail.com]
Sent: Thursday, November 22, 2012 7:34 PM
To: solr-user@lucene.apache.org
Subject: Re: Performance improvement for solr faceting on large index

you could always try the fc facet method and maybe increase the filtercache
size

On Thu, Nov 22, 2012 at 2:53 PM, Pravin Agrawal <
pravin_agra...@persistent.co.in> wrote:

> Hi All,
>
> We are using solr 3.4 with following schema fields.
>
>
> <schema.xml>---------------------------------------------------------------------------------------
>
> <fieldType name="autosuggest_text" class="solr.TextField"
>             positionIncrementGap="100">
>             <analyzer type="index">
>                 <tokenizer class="solr.StandardTokenizerFactory"/>
>                 <filter class="solr.LowerCaseFilterFactory"/>
>                 <filter class="solr.ShingleFilterFactory"
> maxShingleSize="5" outputUnigrams="true"/>
>                 <filter class="solr.PatternReplaceFilterFactory"
> pattern="^([0-9. ])*$" replacement=""
>                     replace="all"/>
>                 <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>             </analyzer>
>             <analyzer type="query">
>                 <tokenizer class="solr.StandardTokenizerFactory"/>
>                 <filter class="solr.LowerCaseFilterFactory"/>
>             </analyzer>
>         </fieldType>
>
> <field name="id" type="string" stored="true" indexed="true"/>
> <field name="autoSuggestContent" type="autosuggest_text" stored="true"
> indexed="true" multiValued="true"/>
>         <copyField source="content" dest="autoSuggestContent"/>
>         <copyField source="original_title" dest="autoSuggestContent"/>
>
> <field name="content" type="text" stored="true" indexed="true"/>
> <field name="original_title" type="text" stored="true" indexed="true"/>
> <field name="site" type="site" stored="false" indexed="true"/>
>
>
> </schema.xml>---------------------------------------------------------------------------------------
>
> The index on above schema is distributed on two solr shards with each
> index size of about 1.2 million, and size on disk of about 195GB per shard.
>
> We want to retrieve (site, autoSuggestContent term, frequency of the term)
> information from our above main solr index. The site is a field in document
> and contains name of site to which that document belongs. The terms are
> retrieved from multivalued field autoSuggestContent which is created using
> shingles from content and title of the web page.
>
> As of now, we are using facet query to retrieve (term, frequency of term)
>  for each site. Below is a sample query (you may ignore initial part of
> query)
>
>
> http://localhost:8080/solr/select?indent=on&q=*:*&fq=site:www.abc.com&start=0&rows=0&fl=id&qt=dismax&facet=true&facet.field=autoSuggestContent&facet.mincount=25&facet.limit=-1&facet.method=enum&facet.sort=index
>
> The problem is that with increase in index size, this method has started
> taking huge time. It used to take 7 minutes per site with index size of
> 0.4 million docs but takes around 60-90 minutes for index size of 2.5
> million(). With this speed, it will take around 5-6 days to index complete
> 1500 sites. Also we are expecting the index size to grow with more
> documents and more sites and as such time to get the above information will
> increase further.
>
> Please let us know if there is any better way to extract (site, term,
> frequency) information compare to current method.
>
> Thanks,
> Pravin Agrawal
>
>
>
>
> DISCLAIMER
> ==========
> This e-mail may contain privileged and confidential information which is
> the property of Persistent Systems Ltd. It is intended only for the use of
> the individual or entity to which it is addressed. If you are not the
> intended recipient, you are not authorized to read, retain, copy, print,
> distribute or use this message. If you have received this communication in
> error, please notify the sender and delete all copies of this message.
> Persistent Systems Ltd. does not accept any liability for virus infected
> mails.
>

DISCLAIMER
==========
This e-mail may contain privileged and confidential information which is the 
property of Persistent Systems Ltd. It is intended only for the use of the 
individual or entity to which it is addressed. If you are not the intended 
recipient, you are not authorized to read, retain, copy, print, distribute or 
use this message. If you have received this communication in error, please 
notify the sender and delete all copies of this message. Persistent Systems 
Ltd. does not accept any liability for virus infected mails.

RE: Performance improvement for solr faceting on large index

Reply via email to