Re: Shingle and Query Performance

Erick Erickson Mon, 29 Aug 2011 05:10:04 -0700

200K docs and 36G index? It sounds like you're storing
your documents in the Solr index. In and of itself, that
shouldn't hurt your query times, *unless* you have
lazy field loading turned off, have you checked that
lazy field loading is enabled?




Best
Erick

On Sun, Aug 28, 2011 at 5:30 AM, Lord Khan Han <khanuniver...@gmail.com> wrote:
> Another insteresting thing is : all one word or more word queries including
> phrase queries such as "barack obama"  slower in shingle configuration. What
> i am doing wrong ? without shingle "barack obama" Querytime 300ms  with
> shingle  780 ms..
>
>
> On Sat, Aug 27, 2011 at 7:58 PM, Lord Khan Han <khanuniver...@gmail.com>wrote:
>
>> Hi,
>>
>> What is the difference between solr 3.3  and the trunk ?
>> I will try 3.3  and let you know the results.
>>
>>
>> Here the search handler:
>>
>> <requestHandler name="search" class="solr.SearchHandler" default="true">
>>      <lst name="defaults">
>>        <str name="echoParams">explicit</str>
>>        <int name="rows">10</int>
>>        <!--<str name="fq">category:vv</str>-->
>>  <str name="fq">mrank:[0 TO 100]</str>
>>        <str name="echoParams">explicit</str>
>>        <int name="rows">10</int>
>>  <str name="defType">edismax</str>
>>        <!--<str name="qf">title^0.05 url^1.2 content^1.7
>> m_title^10.0</str>-->
>> <str name="qf">title^1.05 url^1.2 content^1.7 m_title^10.0</str>
>>  <!-- <str name="bf">recip(ee_score,-0.85,1,0.2)</str> -->
>>  <str name="pf">content^18.0 m_title^5.0</str>
>>  <int name="ps">1</int>
>>  <int name="qs">0</int>
>>  <str name="mm">2&lt;-25%</str>
>>  <str name="spellcheck">true</str>
>>  <!--<str name="spellcheck.collate">true</str>   -->
>> <str name="spellcheck.count">5</str>
>>  <str name="spellcheck.dictionary">subobjective</str>
>> <str name="spellcheck.onlyMorePopular">false</str>
>>   <str name="hl.tag.pre">&lt;b&gt;</str>
>> <str name="hl.tag.post">&lt;/b&gt;</str>
>>  <str name="hl.useFastVectorHighlighter">true</str>
>>      </lst>
>>
>>
>>
>>
>> On Sat, Aug 27, 2011 at 5:31 PM, Erik Hatcher <erik.hatc...@gmail.com>wrote:
>>
>>> I'm not sure what the issue could be at this point.   I see you've got
>>> qt=search - what's the definition of that request handler?
>>>
>>> What is the parsed query (from the debugQuery response)?
>>>
>>> Have you tried this with Solr 3.3 to see if there's any appreciable
>>> difference?
>>>
>>>        Erik
>>>
>>> On Aug 27, 2011, at 09:34 , Lord Khan Han wrote:
>>>
>>> > When grouping off the query time ie 3567 ms  to 1912 ms . Grouping
>>> > increasing the query time and make useless to cache. But same config
>>> faster
>>> > without shingle still.
>>> >
>>> > We have and head to head test this wednesday tihs commercial search
>>> engine.
>>> > So I am looking for all suggestions.
>>> >
>>> >
>>> >
>>> > On Sat, Aug 27, 2011 at 3:37 PM, Erik Hatcher <erik.hatc...@gmail.com
>>> >wrote:
>>> >
>>> >> Please confirm is this is caused by grouping.  Turn grouping off,
>>> what's
>>> >> query time like?
>>> >>
>>> >>
>>> >> On Aug 27, 2011, at 07:27 , Lord Khan Han wrote:
>>> >>
>>> >>> On the other hand We couldnt use the cache for below types queries. I
>>> >> think
>>> >>> its caused from grouping. Anyway we need to be sub second without
>>> cache.
>>> >>>
>>> >>>
>>> >>>
>>> >>> On Sat, Aug 27, 2011 at 2:18 PM, Lord Khan Han <
>>> khanuniver...@gmail.com
>>> >>> wrote:
>>> >>>
>>> >>>> Hi,
>>> >>>>
>>> >>>> Thanks for the reply.
>>> >>>>
>>> >>>> Here the solr log capture.:
>>> >>>>
>>> >>>> ******
>>> >>>>
>>> >>>>
>>> >>
>>> hl.fragsize=100&spellcheck=true&spellcheck.q=XXXXX&group.limit=5&hl.simple.pre=<b>&hl.fl=content&spellcheck.collate=true&wt=javabin&hl=true&rows=20&version=2&fl=score,approved,domain,host,id,lang,mimetype,title,tstamp,url,category&hl.snippets=3&start=0&q=%2BXXXX+-"XXXXX"+-"XXXXX"+-"XXXXXX"+-"XXXXXX"+-"XXXXXX"+-XXXX+-"XXXXXX"+-XXX+-"XXXXX"+-XXXX+-XXXX+-"XXXXX"+-"XXXXX"+-"XXXXX"+-XXXX+-"XXXX"+-"XXXXX"+-"XXXXXX"+-"XXXXX"+-"XXXXXX"+-"XXXXXX"+-XXXX+-"XXXXX"+-"XXXXXX"+-XXXX+-"XXXXX"+-"XXXXX"+-XXXXX+-"XXXXX"+-"XXXXX"+-"XXXXX"+-"XXXXX"+-XXXXX+-"XXXXXX"+-"XXXXXX"+-XXXXXX+-XXXXX+-"XXXXX"+"XXXXX"+"XXXXX"+"XXXXXX"++&group.field=host&hl.simple.post=</b>&group=true&qt=search&fq=mrank:[0+TO+100]&fq=word_count:[70+TO+*]
>>> >>>> ******
>>> >>>>
>>> >>>> XXXX is the words. All phrases "xxxxx"  has two words inside.
>>> >>>>
>>> >>>> The timing from the DebugQuery:
>>> >>>>
>>> >>>> <lst name="timing">
>>> >>>> <double name="time">8654.0</double>
>>> >>>> <lst name="prepare">
>>> >>>> <double name="time">16.0</double>
>>> >>>> <lst name="org.apache.solr.handler.component.QueryComponent">
>>> >>>> <double name="time">16.0</double>
>>> >>>> </lst>
>>> >>>> <lst name="org.apache.solr.handler.component.FacetComponent">
>>> >>>> <double name="time">0.0</double>
>>> >>>> </lst>
>>> >>>> <lst name="org.apache.solr.handler.component.MoreLikeThisComponent">
>>> >>>> <double name="time">0.0</double>
>>> >>>> </lst>
>>> >>>> <lst name="org.apache.solr.handler.component.HighlightComponent">
>>> >>>> <double name="time">0.0</double>
>>> >>>> </lst>
>>> >>>> <lst name="org.apache.solr.handler.component.StatsComponent">
>>> >>>> <double name="time">0.0</double>
>>> >>>> </lst>
>>> >>>> <lst name="org.apache.solr.handler.component.SpellCheckComponent">
>>> >>>> <double name="time">0.0</double>
>>> >>>> </lst>
>>> >>>> <lst name="org.apache.solr.handler.component.DebugComponent">
>>> >>>> <double name="time">0.0</double>
>>> >>>> </lst>
>>> >>>> </lst>
>>> >>>> <lst name="process">
>>> >>>> <double name="time">8638.0</double>
>>> >>>> <lst name="org.apache.solr.handler.component.QueryComponent">
>>> >>>> <double name="time">4473.0</double>
>>> >>>> </lst>
>>> >>>> <lst name="org.apache.solr.handler.component.FacetComponent">
>>> >>>> <double name="time">0.0</double>
>>> >>>> </lst>
>>> >>>> <lst name="org.apache.solr.handler.component.MoreLikeThisComponent">
>>> >>>> <double name="time">0.0</double>
>>> >>>> </lst>
>>> >>>> <lst name="org.apache.solr.handler.component.HighlightComponent">
>>> >>>> <double name="time">42.0</double>
>>> >>>> </lst>
>>> >>>> <lst name="org.apache.solr.handler.component.StatsComponent">
>>> >>>> <double name="time">0.0</double>
>>> >>>> </lst>
>>> >>>> <lst name="org.apache.solr.handler.component.SpellCheckComponent">
>>> >>>> <double name="time">1.0</double>
>>> >>>> </lst>
>>> >>>> <lst name="org.apache.solr.handler.component.DebugComponent">
>>> >>>> <double name="time">4122.0</double>
>>> >>>> </lst>
>>> >>>>
>>> >>>>
>>> >>>> The funny thing is if I removed the ShingleFilter from the below
>>> >> "sh_text"
>>> >>>> field and index normally  the query time is half of the current
>>> shingle
>>> >> one
>>> >>>> !. Shouldn't  be shingled index better for such heavy 2 word phrases
>>> >> search
>>> >>>> ? I am confused.
>>> >>>>
>>> >>>> On the other hand One of the on the shelf big FAT companies search
>>> >> engine
>>> >>>> doing the same query same machine 0.7 / 0.8 secs without cache . I am
>>> >>>> confident we can do better in solr but couldnt find the way at the
>>> >> moment.
>>> >>>>
>>> >>>> thanks for helping..
>>> >>>>
>>> >>>>
>>> >>>>
>>> >>>>
>>> >>>> On Sat, Aug 27, 2011 at 2:46 AM, Erik Hatcher <
>>> erik.hatc...@gmail.com
>>> >>> wrote:
>>> >>>>
>>> >>>>>
>>> >>>>> On Aug 26, 2011, at 17:49 , Lord Khan Han wrote:
>>> >>>>>> We are indexing news  document from the various sites. Currently we
>>> >> have
>>> >>>>>> 200K docs indexed. Total index size is 36 gig.  There is also
>>> >>>>> attachement to
>>> >>>>>> the news (pdf -docs etc) So document size could be high (ie 10mb).
>>> >>>>>>
>>> >>>>>> We are using some complex queries which includes around 30 - 40
>>> terms
>>> >>>>> per
>>> >>>>>> query. %70 of this terms is two word phrases. We are using
>>> >>>>>> with conjunction +  and -  to pinpoint exact result.
>>> >>>>>> There is also grouping, dismax and boosting , Termvector HL  .
>>> >>>>>
>>> >>>>> You're using a lot of componentry there, and have complex queries.
>>>  We
>>> >>>>> need more details.
>>> >>>>>
>>> >>>>> Turn on debugQuery=true... what do the timings say for each
>>> component?
>>> >>>>>
>>> >>>>>> Our problem is query times. Currently its around 6-7 secs. I know
>>> our
>>> >>>>> query
>>> >>>>>> is little bit heavy but we want to improve query performance. I
>>> >> believe
>>> >>>>> we
>>> >>>>>> can make it sub second but no succes at the moment.
>>> >>>>>
>>> >>>>> Please provide an example query or two (perhaps a full line logged
>>> from
>>> >>>>> Solr itself), and then let's see what debugQuery says about your
>>> query
>>> >> being
>>> >>>>> parsed.
>>> >>>>>
>>> >>>>>> We tried to use shingle 2 word token it decreases the query
>>> performcen
>>> >>>>> !! We
>>> >>>>>> assumed it will help the speed up phrases search..
>>> >>>>>
>>> >>>>> Again, we'd need to see a parsed query to understand this deeper.
>>> >>>>>
>>> >>>>> Lots of synonym expansion?  A parsed query will tell us.
>>> >>>>>
>>> >>>>>
>>> >>>>>
>>> >>>>>> (using solr latest trunk and HW is pretty good, 32 core  with 32
>>> gig
>>> >>>>> ram)
>>> >>>>>>
>>> >>>>>> Here the field def:
>>> >>>>>>
>>> >>>>>> <fieldType name="sh_text" class="solr.TextField"
>>> >>>>> positionIncrementGap="100"
>>> >>>>>> autoGeneratePhraseQueries="true">
>>> >>>>>>    <analyzer type="index">
>>> >>>>>>      <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>> >>>>>>      <filter class="solr.StopFilterFactory" ignoreCase="true"
>>> >>>>>> words="stopwords.txt" enablePositionIncrements="true" />
>>> >>>>>>      <filter class="solr.WordDelimiterFilterFactory"
>>> >>>>>> generateWordParts="1" generateNumberParts="1" catenateWords="1"
>>> >>>>>> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
>>> >>>>>>      <!--<filter class="solr.LowerCaseFilterFactory"/>-->
>>> >>>>>>      <filter class="solr.KeywordMarkerFilterFactory"
>>> >>>>>> protected="protwords.txt"/>
>>> >>>>>>      <filter class="solr.ShingleFilterFactory" maxShingleSize="2"
>>> >>>>>> outputUnigrams="true"/>
>>> >>>>>>    </analyzer>
>>> >>>>>>    <analyzer type="query">
>>> >>>>>>      <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>> >>>>>>      <filter class="solr.SynonymFilterFactory"
>>> >> synonyms="synonyms.txt"
>>> >>>>>> ignoreCase="true" expand="true"/>
>>> >>>>>>      <filter class="solr.StopFilterFactory" ignoreCase="true"
>>> >>>>>> words="stopwords.txt" enablePositionIncrements="true" />
>>> >>>>>>      <filter class="solr.WordDelimiterFilterFactory"
>>> >>>>>> generateWordParts="1" generateNumberParts="1" catenateWords="0"
>>> >>>>>> catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
>>> >>>>>>      <!--<filter class="solr.LowerCaseFilterFactory"/>-->
>>> >>>>>>      <filter class="solr.KeywordMarkerFilterFactory"
>>> >>>>>> protected="protwords.txt"/>
>>> >>>>>>      <filter class="solr.ShingleFilterFactory" maxShingleSize="2"
>>> >>>>>> outputUnigrams="true"/>
>>> >>>>>>    </analyzer>
>>> >>>>>>  </fieldType>
>>> >>>>>>
>>> >>>>>> and
>>> >>>>>>
>>> >>>>>> <field name="content" type="sh_text" stored="true" indexed="true"
>>> >>>>>> termVectors="true" termPositions="true" termOffsets="true"/>
>>> >>>>>
>>> >>>>>
>>> >>>>
>>> >>
>>> >>
>>>
>>>
>>
>

Re: Shingle and Query Performance

Reply via email to