Hi All,

We are using solr 3.4 with following schema fields.

<schema.xml>---------------------------------------------------------------------------------------

<fieldType name="autosuggest_text" class="solr.TextField"
            positionIncrementGap="100">
            <analyzer type="index">
                <tokenizer class="solr.StandardTokenizerFactory"/>
                <filter class="solr.LowerCaseFilterFactory"/>
                <filter class="solr.ShingleFilterFactory" maxShingleSize="5" 
outputUnigrams="true"/>
                <filter class="solr.PatternReplaceFilterFactory" 
pattern="^([0-9. ])*$" replacement=""
                    replace="all"/>
                <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
            </analyzer>
            <analyzer type="query">
                <tokenizer class="solr.StandardTokenizerFactory"/>
                <filter class="solr.LowerCaseFilterFactory"/>
            </analyzer>
        </fieldType>

<field name="id" type="string" stored="true" indexed="true"/>
<field name="autoSuggestContent" type="autosuggest_text" stored="true" 
indexed="true" multiValued="true"/>
        <copyField source="content" dest="autoSuggestContent"/>
        <copyField source="original_title" dest="autoSuggestContent"/>

<field name="content" type="text" stored="true" indexed="true"/>
<field name="original_title" type="text" stored="true" indexed="true"/>
<field name="site" type="site" stored="false" indexed="true"/>

</schema.xml>---------------------------------------------------------------------------------------

The index on above schema is distributed on two solr shards with each index 
size of about 1.2 million, and size on disk of about 195GB per shard.

We want to retrieve (site, autoSuggestContent term, frequency of the term) 
information from our above main solr index. The site is a field in document and 
contains name of site to which that document belongs. The terms are retrieved 
from multivalued field autoSuggestContent which is created using shingles from 
content and title of the web page.

As of now, we are using facet query to retrieve (term, frequency of term)  for 
each site. Below is a sample query (you may ignore initial part of query)

http://localhost:8080/solr/select?indent=on&q=*:*&fq=site:www.abc.com&start=0&rows=0&fl=id&qt=dismax&facet=true&facet.field=autoSuggestContent&facet.mincount=25&facet.limit=-1&facet.method=enum&facet.sort=index

The problem is that with increase in index size, this method has started taking 
huge time. It used to take 7 minutes per site with index size of
0.4 million docs but takes around 60-90 minutes for index size of 2.5 
million(). With this speed, it will take around 5-6 days to index complete 1500 
sites. Also we are expecting the index size to grow with more documents and 
more sites and as such time to get the above information will increase further.

Please let us know if there is any better way to extract (site, term, 
frequency) information compare to current method.

Thanks,
Pravin Agrawal




DISCLAIMER
==========
This e-mail may contain privileged and confidential information which is the 
property of Persistent Systems Ltd. It is intended only for the use of the 
individual or entity to which it is addressed. If you are not the intended 
recipient, you are not authorized to read, retain, copy, print, distribute or 
use this message. If you have received this communication in error, please 
notify the sender and delete all copies of this message. Persistent Systems 
Ltd. does not accept any liability for virus infected mails.

Reply via email to