Re: Shingle and Query Performance

Erick Erickson Mon, 29 Aug 2011 05:11:34 -0700

Oh, one other thing: have you profiled your machine
to see if you're swapping? How much memory are
you giving your JVM? What is the underlying
hardware setup?


Best
Erick

On Mon, Aug 29, 2011 at 8:09 AM, Erick Erickson <erickerick...@gmail.com> wrote:
> 200K docs and 36G index? It sounds like you're storing
> your documents in the Solr index. In and of itself, that
> shouldn't hurt your query times, *unless* you have
> lazy field loading turned off, have you checked that
> lazy field loading is enabled?
>
>
>
> Best
> Erick
>
> On Sun, Aug 28, 2011 at 5:30 AM, Lord Khan Han <khanuniver...@gmail.com> 
> wrote:
>> Another insteresting thing is : all one word or more word queries including
>> phrase queries such as "barack obama"  slower in shingle configuration. What
>> i am doing wrong ? without shingle "barack obama" Querytime 300ms  with
>> shingle  780 ms..
>>
>>
>> On Sat, Aug 27, 2011 at 7:58 PM, Lord Khan Han 
>> <khanuniver...@gmail.com>wrote:
>>
>>> Hi,
>>>
>>> What is the difference between solr 3.3  and the trunk ?
>>> I will try 3.3  and let you know the results.
>>>
>>>
>>> Here the search handler:
>>>
>>> <requestHandler name="search" class="solr.SearchHandler" default="true">
>>>      <lst name="defaults">
>>>        <str name="echoParams">explicit</str>
>>>        <int name="rows">10</int>
>>>        <!--<str name="fq">category:vv</str>-->
>>>  <str name="fq">mrank:[0 TO 100]</str>
>>>        <str name="echoParams">explicit</str>
>>>        <int name="rows">10</int>
>>>  <str name="defType">edismax</str>
>>>        <!--<str name="qf">title^0.05 url^1.2 content^1.7
>>> m_title^10.0</str>-->
>>> <str name="qf">title^1.05 url^1.2 content^1.7 m_title^10.0</str>
>>>  <!-- <str name="bf">recip(ee_score,-0.85,1,0.2)</str> -->
>>>  <str name="pf">content^18.0 m_title^5.0</str>
>>>  <int name="ps">1</int>
>>>  <int name="qs">0</int>
>>>  <str name="mm">2&lt;-25%</str>
>>>  <str name="spellcheck">true</str>
>>>  <!--<str name="spellcheck.collate">true</str>   -->
>>> <str name="spellcheck.count">5</str>
>>>  <str name="spellcheck.dictionary">subobjective</str>
>>> <str name="spellcheck.onlyMorePopular">false</str>
>>>   <str name="hl.tag.pre">&lt;b&gt;</str>
>>> <str name="hl.tag.post">&lt;/b&gt;</str>
>>>  <str name="hl.useFastVectorHighlighter">true</str>
>>>      </lst>
>>>
>>>
>>>
>>>
>>> On Sat, Aug 27, 2011 at 5:31 PM, Erik Hatcher <erik.hatc...@gmail.com>wrote:
>>>
>>>> I'm not sure what the issue could be at this point.   I see you've got
>>>> qt=search - what's the definition of that request handler?
>>>>
>>>> What is the parsed query (from the debugQuery response)?
>>>>
>>>> Have you tried this with Solr 3.3 to see if there's any appreciable
>>>> difference?
>>>>
>>>>        Erik
>>>>
>>>> On Aug 27, 2011, at 09:34 , Lord Khan Han wrote:
>>>>
>>>> > When grouping off the query time ie 3567 ms  to 1912 ms . Grouping
>>>> > increasing the query time and make useless to cache. But same config
>>>> faster
>>>> > without shingle still.
>>>> >
>>>> > We have and head to head test this wednesday tihs commercial search
>>>> engine.
>>>> > So I am looking for all suggestions.
>>>> >
>>>> >
>>>> >
>>>> > On Sat, Aug 27, 2011 at 3:37 PM, Erik Hatcher <erik.hatc...@gmail.com
>>>> >wrote:
>>>> >
>>>> >> Please confirm is this is caused by grouping.  Turn grouping off,
>>>> what's
>>>> >> query time like?
>>>> >>
>>>> >>
>>>> >> On Aug 27, 2011, at 07:27 , Lord Khan Han wrote:
>>>> >>
>>>> >>> On the other hand We couldnt use the cache for below types queries. I
>>>> >> think
>>>> >>> its caused from grouping. Anyway we need to be sub second without
>>>> cache.
>>>> >>>
>>>> >>>
>>>> >>>
>>>> >>> On Sat, Aug 27, 2011 at 2:18 PM, Lord Khan Han <
>>>> khanuniver...@gmail.com
>>>> >>> wrote:
>>>> >>>
>>>> >>>> Hi,
>>>> >>>>
>>>> >>>> Thanks for the reply.
>>>> >>>>
>>>> >>>> Here the solr log capture.:
>>>> >>>>
>>>> >>>> ******
>>>> >>>>
>>>> >>>>
>>>> >>
>>>> hl.fragsize=100&spellcheck=true&spellcheck.q=XXXXX&group.limit=5&hl.simple.pre=<b>&hl.fl=content&spellcheck.collate=true&wt=javabin&hl=true&rows=20&version=2&fl=score,approved,domain,host,id,lang,mimetype,title,tstamp,url,category&hl.snippets=3&start=0&q=%2BXXXX+-"XXXXX"+-"XXXXX"+-"XXXXXX"+-"XXXXXX"+-"XXXXXX"+-XXXX+-"XXXXXX"+-XXX+-"XXXXX"+-XXXX+-XXXX+-"XXXXX"+-"XXXXX"+-"XXXXX"+-XXXX+-"XXXX"+-"XXXXX"+-"XXXXXX"+-"XXXXX"+-"XXXXXX"+-"XXXXXX"+-XXXX+-"XXXXX"+-"XXXXXX"+-XXXX+-"XXXXX"+-"XXXXX"+-XXXXX+-"XXXXX"+-"XXXXX"+-"XXXXX"+-"XXXXX"+-XXXXX+-"XXXXXX"+-"XXXXXX"+-XXXXXX+-XXXXX+-"XXXXX"+"XXXXX"+"XXXXX"+"XXXXXX"++&group.field=host&hl.simple.post=</b>&group=true&qt=search&fq=mrank:[0+TO+100]&fq=word_count:[70+TO+*]
>>>> >>>> ******
>>>> >>>>
>>>> >>>> XXXX is the words. All phrases "xxxxx"  has two words inside.
>>>> >>>>
>>>> >>>> The timing from the DebugQuery:
>>>> >>>>
>>>> >>>> <lst name="timing">
>>>> >>>> <double name="time">8654.0</double>
>>>> >>>> <lst name="prepare">
>>>> >>>> <double name="time">16.0</double>
>>>> >>>> <lst name="org.apache.solr.handler.component.QueryComponent">
>>>> >>>> <double name="time">16.0</double>
>>>> >>>> </lst>
>>>> >>>> <lst name="org.apache.solr.handler.component.FacetComponent">
>>>> >>>> <double name="time">0.0</double>
>>>> >>>> </lst>
>>>> >>>> <lst name="org.apache.solr.handler.component.MoreLikeThisComponent">
>>>> >>>> <double name="time">0.0</double>
>>>> >>>> </lst>
>>>> >>>> <lst name="org.apache.solr.handler.component.HighlightComponent">
>>>> >>>> <double name="time">0.0</double>
>>>> >>>> </lst>
>>>> >>>> <lst name="org.apache.solr.handler.component.StatsComponent">
>>>> >>>> <double name="time">0.0</double>
>>>> >>>> </lst>
>>>> >>>> <lst name="org.apache.solr.handler.component.SpellCheckComponent">
>>>> >>>> <double name="time">0.0</double>
>>>> >>>> </lst>
>>>> >>>> <lst name="org.apache.solr.handler.component.DebugComponent">
>>>> >>>> <double name="time">0.0</double>
>>>> >>>> </lst>
>>>> >>>> </lst>
>>>> >>>> <lst name="process">
>>>> >>>> <double name="time">8638.0</double>
>>>> >>>> <lst name="org.apache.solr.handler.component.QueryComponent">
>>>> >>>> <double name="time">4473.0</double>
>>>> >>>> </lst>
>>>> >>>> <lst name="org.apache.solr.handler.component.FacetComponent">
>>>> >>>> <double name="time">0.0</double>
>>>> >>>> </lst>
>>>> >>>> <lst name="org.apache.solr.handler.component.MoreLikeThisComponent">
>>>> >>>> <double name="time">0.0</double>
>>>> >>>> </lst>
>>>> >>>> <lst name="org.apache.solr.handler.component.HighlightComponent">
>>>> >>>> <double name="time">42.0</double>
>>>> >>>> </lst>
>>>> >>>> <lst name="org.apache.solr.handler.component.StatsComponent">
>>>> >>>> <double name="time">0.0</double>
>>>> >>>> </lst>
>>>> >>>> <lst name="org.apache.solr.handler.component.SpellCheckComponent">
>>>> >>>> <double name="time">1.0</double>
>>>> >>>> </lst>
>>>> >>>> <lst name="org.apache.solr.handler.component.DebugComponent">
>>>> >>>> <double name="time">4122.0</double>
>>>> >>>> </lst>
>>>> >>>>
>>>> >>>>
>>>> >>>> The funny thing is if I removed the ShingleFilter from the below
>>>> >> "sh_text"
>>>> >>>> field and index normally  the query time is half of the current
>>>> shingle
>>>> >> one
>>>> >>>> !. Shouldn't  be shingled index better for such heavy 2 word phrases
>>>> >> search
>>>> >>>> ? I am confused.
>>>> >>>>
>>>> >>>> On the other hand One of the on the shelf big FAT companies search
>>>> >> engine
>>>> >>>> doing the same query same machine 0.7 / 0.8 secs without cache . I am
>>>> >>>> confident we can do better in solr but couldnt find the way at the
>>>> >> moment.
>>>> >>>>
>>>> >>>> thanks for helping..
>>>> >>>>
>>>> >>>>
>>>> >>>>
>>>> >>>>
>>>> >>>> On Sat, Aug 27, 2011 at 2:46 AM, Erik Hatcher <
>>>> erik.hatc...@gmail.com
>>>> >>> wrote:
>>>> >>>>
>>>> >>>>>
>>>> >>>>> On Aug 26, 2011, at 17:49 , Lord Khan Han wrote:
>>>> >>>>>> We are indexing news  document from the various sites. Currently we
>>>> >> have
>>>> >>>>>> 200K docs indexed. Total index size is 36 gig.  There is also
>>>> >>>>> attachement to
>>>> >>>>>> the news (pdf -docs etc) So document size could be high (ie 10mb).
>>>> >>>>>>
>>>> >>>>>> We are using some complex queries which includes around 30 - 40
>>>> terms
>>>> >>>>> per
>>>> >>>>>> query. %70 of this terms is two word phrases. We are using
>>>> >>>>>> with conjunction +  and -  to pinpoint exact result.
>>>> >>>>>> There is also grouping, dismax and boosting , Termvector HL  .
>>>> >>>>>
>>>> >>>>> You're using a lot of componentry there, and have complex queries.
>>>>  We
>>>> >>>>> need more details.
>>>> >>>>>
>>>> >>>>> Turn on debugQuery=true... what do the timings say for each
>>>> component?
>>>> >>>>>
>>>> >>>>>> Our problem is query times. Currently its around 6-7 secs. I know
>>>> our
>>>> >>>>> query
>>>> >>>>>> is little bit heavy but we want to improve query performance. I
>>>> >> believe
>>>> >>>>> we
>>>> >>>>>> can make it sub second but no succes at the moment.
>>>> >>>>>
>>>> >>>>> Please provide an example query or two (perhaps a full line logged
>>>> from
>>>> >>>>> Solr itself), and then let's see what debugQuery says about your
>>>> query
>>>> >> being
>>>> >>>>> parsed.
>>>> >>>>>
>>>> >>>>>> We tried to use shingle 2 word token it decreases the query
>>>> performcen
>>>> >>>>> !! We
>>>> >>>>>> assumed it will help the speed up phrases search..
>>>> >>>>>
>>>> >>>>> Again, we'd need to see a parsed query to understand this deeper.
>>>> >>>>>
>>>> >>>>> Lots of synonym expansion?  A parsed query will tell us.
>>>> >>>>>
>>>> >>>>>
>>>> >>>>>
>>>> >>>>>> (using solr latest trunk and HW is pretty good, 32 core  with 32
>>>> gig
>>>> >>>>> ram)
>>>> >>>>>>
>>>> >>>>>> Here the field def:
>>>> >>>>>>
>>>> >>>>>> <fieldType name="sh_text" class="solr.TextField"
>>>> >>>>> positionIncrementGap="100"
>>>> >>>>>> autoGeneratePhraseQueries="true">
>>>> >>>>>>    <analyzer type="index">
>>>> >>>>>>      <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>>> >>>>>>      <filter class="solr.StopFilterFactory" ignoreCase="true"
>>>> >>>>>> words="stopwords.txt" enablePositionIncrements="true" />
>>>> >>>>>>      <filter class="solr.WordDelimiterFilterFactory"
>>>> >>>>>> generateWordParts="1" generateNumberParts="1" catenateWords="1"
>>>> >>>>>> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
>>>> >>>>>>      <!--<filter class="solr.LowerCaseFilterFactory"/>-->
>>>> >>>>>>      <filter class="solr.KeywordMarkerFilterFactory"
>>>> >>>>>> protected="protwords.txt"/>
>>>> >>>>>>      <filter class="solr.ShingleFilterFactory" maxShingleSize="2"
>>>> >>>>>> outputUnigrams="true"/>
>>>> >>>>>>    </analyzer>
>>>> >>>>>>    <analyzer type="query">
>>>> >>>>>>      <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>>> >>>>>>      <filter class="solr.SynonymFilterFactory"
>>>> >> synonyms="synonyms.txt"
>>>> >>>>>> ignoreCase="true" expand="true"/>
>>>> >>>>>>      <filter class="solr.StopFilterFactory" ignoreCase="true"
>>>> >>>>>> words="stopwords.txt" enablePositionIncrements="true" />
>>>> >>>>>>      <filter class="solr.WordDelimiterFilterFactory"
>>>> >>>>>> generateWordParts="1" generateNumberParts="1" catenateWords="0"
>>>> >>>>>> catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
>>>> >>>>>>      <!--<filter class="solr.LowerCaseFilterFactory"/>-->
>>>> >>>>>>      <filter class="solr.KeywordMarkerFilterFactory"
>>>> >>>>>> protected="protwords.txt"/>
>>>> >>>>>>      <filter class="solr.ShingleFilterFactory" maxShingleSize="2"
>>>> >>>>>> outputUnigrams="true"/>
>>>> >>>>>>    </analyzer>
>>>> >>>>>>  </fieldType>
>>>> >>>>>>
>>>> >>>>>> and
>>>> >>>>>>
>>>> >>>>>> <field name="content" type="sh_text" stored="true" indexed="true"
>>>> >>>>>> termVectors="true" termPositions="true" termOffsets="true"/>
>>>> >>>>>
>>>> >>>>>
>>>> >>>>
>>>> >>
>>>> >>
>>>>
>>>>
>>>
>>
>

Re: Shingle and Query Performance

Reply via email to