Re: Shingle and Query Performance

Erik Hatcher Sat, 27 Aug 2011 05:38:31 -0700

Please confirm is this is caused by grouping.  Turn grouping off, what's query 
time like?



On Aug 27, 2011, at 07:27 , Lord Khan Han wrote:

> On the other hand We couldnt use the cache for below types queries. I think
> its caused from grouping. Anyway we need to be sub second without cache.
> 
> 
> 
> On Sat, Aug 27, 2011 at 2:18 PM, Lord Khan Han <khanuniver...@gmail.com>wrote:
> 
>> Hi,
>> 
>> Thanks for the reply.
>> 
>> Here the solr log capture.:
>> 
>> ******
>> 
>> hl.fragsize=100&spellcheck=true&spellcheck.q=XXXXX&group.limit=5&hl.simple.pre=<b>&hl.fl=content&spellcheck.collate=true&wt=javabin&hl=true&rows=20&version=2&fl=score,approved,domain,host,id,lang,mimetype,title,tstamp,url,category&hl.snippets=3&start=0&q=%2BXXXX+-"XXXXX"+-"XXXXX"+-"XXXXXX"+-"XXXXXX"+-"XXXXXX"+-XXXX+-"XXXXXX"+-XXX+-"XXXXX"+-XXXX+-XXXX+-"XXXXX"+-"XXXXX"+-"XXXXX"+-XXXX+-"XXXX"+-"XXXXX"+-"XXXXXX"+-"XXXXX"+-"XXXXXX"+-"XXXXXX"+-XXXX+-"XXXXX"+-"XXXXXX"+-XXXX+-"XXXXX"+-"XXXXX"+-XXXXX+-"XXXXX"+-"XXXXX"+-"XXXXX"+-"XXXXX"+-XXXXX+-"XXXXXX"+-"XXXXXX"+-XXXXXX+-XXXXX+-"XXXXX"+"XXXXX"+"XXXXX"+"XXXXXX"++&group.field=host&hl.simple.post=</b>&group=true&qt=search&fq=mrank:[0+TO+100]&fq=word_count:[70+TO+*]
>> ******
>> 
>> XXXX is the words. All phrases "xxxxx"  has two words inside.
>> 
>> The timing from the DebugQuery:
>> 
>> <lst name="timing">
>> <double name="time">8654.0</double>
>> <lst name="prepare">
>> <double name="time">16.0</double>
>> <lst name="org.apache.solr.handler.component.QueryComponent">
>> <double name="time">16.0</double>
>> </lst>
>> <lst name="org.apache.solr.handler.component.FacetComponent">
>> <double name="time">0.0</double>
>> </lst>
>> <lst name="org.apache.solr.handler.component.MoreLikeThisComponent">
>> <double name="time">0.0</double>
>> </lst>
>> <lst name="org.apache.solr.handler.component.HighlightComponent">
>> <double name="time">0.0</double>
>> </lst>
>> <lst name="org.apache.solr.handler.component.StatsComponent">
>> <double name="time">0.0</double>
>> </lst>
>> <lst name="org.apache.solr.handler.component.SpellCheckComponent">
>> <double name="time">0.0</double>
>> </lst>
>> <lst name="org.apache.solr.handler.component.DebugComponent">
>> <double name="time">0.0</double>
>> </lst>
>> </lst>
>> <lst name="process">
>> <double name="time">8638.0</double>
>> <lst name="org.apache.solr.handler.component.QueryComponent">
>> <double name="time">4473.0</double>
>> </lst>
>> <lst name="org.apache.solr.handler.component.FacetComponent">
>> <double name="time">0.0</double>
>> </lst>
>> <lst name="org.apache.solr.handler.component.MoreLikeThisComponent">
>> <double name="time">0.0</double>
>> </lst>
>> <lst name="org.apache.solr.handler.component.HighlightComponent">
>> <double name="time">42.0</double>
>> </lst>
>> <lst name="org.apache.solr.handler.component.StatsComponent">
>> <double name="time">0.0</double>
>> </lst>
>> <lst name="org.apache.solr.handler.component.SpellCheckComponent">
>> <double name="time">1.0</double>
>> </lst>
>> <lst name="org.apache.solr.handler.component.DebugComponent">
>> <double name="time">4122.0</double>
>> </lst>
>> 
>> 
>> The funny thing is if I removed the ShingleFilter from the below "sh_text"
>> field and index normally  the query time is half of the current shingle one
>> !. Shouldn't  be shingled index better for such heavy 2 word phrases search
>> ? I am confused.
>> 
>> On the other hand One of the on the shelf big FAT companies search engine
>> doing the same query same machine 0.7 / 0.8 secs without cache . I am
>> confident we can do better in solr but couldnt find the way at the moment.
>> 
>> thanks for helping..
>> 
>> 
>> 
>> 
>> On Sat, Aug 27, 2011 at 2:46 AM, Erik Hatcher <erik.hatc...@gmail.com>wrote:
>> 
>>> 
>>> On Aug 26, 2011, at 17:49 , Lord Khan Han wrote:
>>>> We are indexing news  document from the various sites. Currently we have
>>>> 200K docs indexed. Total index size is 36 gig.  There is also
>>> attachement to
>>>> the news (pdf -docs etc) So document size could be high (ie 10mb).
>>>> 
>>>> We are using some complex queries which includes around 30 - 40 terms
>>> per
>>>> query. %70 of this terms is two word phrases. We are using
>>>> with conjunction +  and -  to pinpoint exact result.
>>>> There is also grouping, dismax and boosting , Termvector HL  .
>>> 
>>> You're using a lot of componentry there, and have complex queries.  We
>>> need more details.
>>> 
>>> Turn on debugQuery=true... what do the timings say for each component?
>>> 
>>>> Our problem is query times. Currently its around 6-7 secs. I know our
>>> query
>>>> is little bit heavy but we want to improve query performance. I believe
>>> we
>>>> can make it sub second but no succes at the moment.
>>> 
>>> Please provide an example query or two (perhaps a full line logged from
>>> Solr itself), and then let's see what debugQuery says about your query being
>>> parsed.
>>> 
>>>> We tried to use shingle 2 word token it decreases the query performcen
>>> !! We
>>>> assumed it will help the speed up phrases search..
>>> 
>>> Again, we'd need to see a parsed query to understand this deeper.
>>> 
>>> Lots of synonym expansion?  A parsed query will tell us.
>>> 
>>> 
>>> 
>>>> (using solr latest trunk and HW is pretty good, 32 core  with 32 gig
>>> ram)
>>>> 
>>>> Here the field def:
>>>> 
>>>> <fieldType name="sh_text" class="solr.TextField"
>>> positionIncrementGap="100"
>>>> autoGeneratePhraseQueries="true">
>>>>     <analyzer type="index">
>>>>       <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>>>       <filter class="solr.StopFilterFactory" ignoreCase="true"
>>>> words="stopwords.txt" enablePositionIncrements="true" />
>>>>       <filter class="solr.WordDelimiterFilterFactory"
>>>> generateWordParts="1" generateNumberParts="1" catenateWords="1"
>>>> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
>>>>       <!--<filter class="solr.LowerCaseFilterFactory"/>-->
>>>>       <filter class="solr.KeywordMarkerFilterFactory"
>>>> protected="protwords.txt"/>
>>>>       <filter class="solr.ShingleFilterFactory" maxShingleSize="2"
>>>> outputUnigrams="true"/>
>>>>     </analyzer>
>>>>     <analyzer type="query">
>>>>       <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>>>       <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
>>>> ignoreCase="true" expand="true"/>
>>>>       <filter class="solr.StopFilterFactory" ignoreCase="true"
>>>> words="stopwords.txt" enablePositionIncrements="true" />
>>>>       <filter class="solr.WordDelimiterFilterFactory"
>>>> generateWordParts="1" generateNumberParts="1" catenateWords="0"
>>>> catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
>>>>       <!--<filter class="solr.LowerCaseFilterFactory"/>-->
>>>>       <filter class="solr.KeywordMarkerFilterFactory"
>>>> protected="protwords.txt"/>
>>>>       <filter class="solr.ShingleFilterFactory" maxShingleSize="2"
>>>> outputUnigrams="true"/>
>>>>     </analyzer>
>>>>   </fieldType>
>>>> 
>>>> and
>>>> 
>>>> <field name="content" type="sh_text" stored="true" indexed="true"
>>>> termVectors="true" termPositions="true" termOffsets="true"/>
>>> 
>>> 
>>

Re: Shingle and Query Performance

Reply via email to