Hi,
We are indexing news document from the various sites. Currently we have
200K docs indexed. Total index size is 36 gig. There is also attachement to
the news (pdf -docs etc) So document size could be high (ie 10mb).
We are using some complex queries which includes around 30 - 40 terms per
query. %70 of this terms is two word phrases. We are using
with conjunction + and - to pinpoint exact result.
There is also grouping, dismax and boosting , Termvector HL .
Our problem is query times. Currently its around 6-7 secs. I know our query
is little bit heavy but we want to improve query performance. I believe we
can make it sub second but no succes at the moment.
We tried to use shingle 2 word token it decreases the query performcen !! We
assumed it will help the speed up phrases search.. What could be
your suggestions ? What we are missing.
(using solr latest trunk and HW is pretty good, 32 core with 32 gig ram)
Here the field def:
<fieldType name="sh_text" class="solr.TextField" positionIncrementGap="100"
autoGeneratePhraseQueries="true">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" enablePositionIncrements="true" />
<filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="1"
catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
<!--<filter class="solr.LowerCaseFilterFactory"/>-->
<filter class="solr.KeywordMarkerFilterFactory"
protected="protwords.txt"/>
<filter class="solr.ShingleFilterFactory" maxShingleSize="2"
outputUnigrams="true"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="true"/>
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" enablePositionIncrements="true" />
<filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="0"
catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
<!--<filter class="solr.LowerCaseFilterFactory"/>-->
<filter class="solr.KeywordMarkerFilterFactory"
protected="protwords.txt"/>
<filter class="solr.ShingleFilterFactory" maxShingleSize="2"
outputUnigrams="true"/>
</analyzer>
</fieldType>
and
<field name="content" type="sh_text" stored="true" indexed="true"
termVectors="true" termPositions="true" termOffsets="true"/>