Shingle and Query Performance

Lord Khan Han Fri, 26 Aug 2011 14:49:54 -0700

Hi,

We are indexing news  document from the various sites. Currently we have
200K docs indexed. Total index size is 36 gig.  There is also attachement to
the news (pdf -docs etc) So document size could be high (ie 10mb).


We are using some complex queries which includes around 30 - 40 terms per
query. %70 of this terms is two word phrases. We are using
with conjunction +  and -  to pinpoint exact result.
There is also grouping, dismax and boosting , Termvector HL  .

Our problem is query times. Currently its around 6-7 secs. I know our query
is little bit heavy but we want to improve query performance. I believe we
can make it sub second but no succes at the moment.

We tried to use shingle 2 word token it decreases the query performcen !! We
assumed it will help the speed up phrases search..  What could be
your suggestions ? What we are missing.

(using solr latest trunk and HW is pretty good, 32 core  with 32 gig ram)

Here the field def:

<fieldType name="sh_text" class="solr.TextField" positionIncrementGap="100"
autoGeneratePhraseQueries="true">
      <analyzer type="index">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" enablePositionIncrements="true" />
        <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="1"
catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
        <!--<filter class="solr.LowerCaseFilterFactory"/>-->
        <filter class="solr.KeywordMarkerFilterFactory"
protected="protwords.txt"/>
        <filter class="solr.ShingleFilterFactory" maxShingleSize="2"
outputUnigrams="true"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="true"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" enablePositionIncrements="true" />
        <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="0"
catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
        <!--<filter class="solr.LowerCaseFilterFactory"/>-->
        <filter class="solr.KeywordMarkerFilterFactory"
protected="protwords.txt"/>
        <filter class="solr.ShingleFilterFactory" maxShingleSize="2"
outputUnigrams="true"/>
      </analyzer>
    </fieldType>

and

 <field name="content" type="sh_text" stored="true" indexed="true"
termVectors="true" termPositions="true" termOffsets="true"/>

Shingle and Query Performance

Reply via email to