Re: Shingle and Query Performance

Lord Khan Han Sun, 28 Aug 2011 02:31:15 -0700

Another insteresting thing is : all one word or more word queries including
phrase queries such as "barack obama"  slower in shingle configuration. What
i am doing wrong ? without shingle "barack obama" Querytime 300ms  with
shingle  780 ms..



On Sat, Aug 27, 2011 at 7:58 PM, Lord Khan Han <khanuniver...@gmail.com>wrote:

> Hi,
>
> What is the difference between solr 3.3  and the trunk ?
> I will try 3.3  and let you know the results.
>
>
> Here the search handler:
>
> <requestHandler name="search" class="solr.SearchHandler" default="true">
>      <lst name="defaults">
>        <str name="echoParams">explicit</str>
>        <int name="rows">10</int>
>        <!--<str name="fq">category:vv</str>-->
>  <str name="fq">mrank:[0 TO 100]</str>
>        <str name="echoParams">explicit</str>
>        <int name="rows">10</int>
>  <str name="defType">edismax</str>
>        <!--<str name="qf">title^0.05 url^1.2 content^1.7
> m_title^10.0</str>-->
> <str name="qf">title^1.05 url^1.2 content^1.7 m_title^10.0</str>
>  <!-- <str name="bf">recip(ee_score,-0.85,1,0.2)</str> -->
>  <str name="pf">content^18.0 m_title^5.0</str>
>  <int name="ps">1</int>
>  <int name="qs">0</int>
>  <str name="mm">2&lt;-25%</str>
>  <str name="spellcheck">true</str>
>  <!--<str name="spellcheck.collate">true</str>   -->
> <str name="spellcheck.count">5</str>
>  <str name="spellcheck.dictionary">subobjective</str>
> <str name="spellcheck.onlyMorePopular">false</str>
>   <str name="hl.tag.pre">&lt;b&gt;</str>
> <str name="hl.tag.post">&lt;/b&gt;</str>
>  <str name="hl.useFastVectorHighlighter">true</str>
>      </lst>
>
>
>
>
> On Sat, Aug 27, 2011 at 5:31 PM, Erik Hatcher <erik.hatc...@gmail.com>wrote:
>
>> I'm not sure what the issue could be at this point.   I see you've got
>> qt=search - what's the definition of that request handler?
>>
>> What is the parsed query (from the debugQuery response)?
>>
>> Have you tried this with Solr 3.3 to see if there's any appreciable
>> difference?
>>
>>        Erik
>>
>> On Aug 27, 2011, at 09:34 , Lord Khan Han wrote:
>>
>> > When grouping off the query time ie 3567 ms  to 1912 ms . Grouping
>> > increasing the query time and make useless to cache. But same config
>> faster
>> > without shingle still.
>> >
>> > We have and head to head test this wednesday tihs commercial search
>> engine.
>> > So I am looking for all suggestions.
>> >
>> >
>> >
>> > On Sat, Aug 27, 2011 at 3:37 PM, Erik Hatcher <erik.hatc...@gmail.com
>> >wrote:
>> >
>> >> Please confirm is this is caused by grouping.  Turn grouping off,
>> what's
>> >> query time like?
>> >>
>> >>
>> >> On Aug 27, 2011, at 07:27 , Lord Khan Han wrote:
>> >>
>> >>> On the other hand We couldnt use the cache for below types queries. I
>> >> think
>> >>> its caused from grouping. Anyway we need to be sub second without
>> cache.
>> >>>
>> >>>
>> >>>
>> >>> On Sat, Aug 27, 2011 at 2:18 PM, Lord Khan Han <
>> khanuniver...@gmail.com
>> >>> wrote:
>> >>>
>> >>>> Hi,
>> >>>>
>> >>>> Thanks for the reply.
>> >>>>
>> >>>> Here the solr log capture.:
>> >>>>
>> >>>> ******
>> >>>>
>> >>>>
>> >>
>> hl.fragsize=100&spellcheck=true&spellcheck.q=XXXXX&group.limit=5&hl.simple.pre=<b>&hl.fl=content&spellcheck.collate=true&wt=javabin&hl=true&rows=20&version=2&fl=score,approved,domain,host,id,lang,mimetype,title,tstamp,url,category&hl.snippets=3&start=0&q=%2BXXXX+-"XXXXX"+-"XXXXX"+-"XXXXXX"+-"XXXXXX"+-"XXXXXX"+-XXXX+-"XXXXXX"+-XXX+-"XXXXX"+-XXXX+-XXXX+-"XXXXX"+-"XXXXX"+-"XXXXX"+-XXXX+-"XXXX"+-"XXXXX"+-"XXXXXX"+-"XXXXX"+-"XXXXXX"+-"XXXXXX"+-XXXX+-"XXXXX"+-"XXXXXX"+-XXXX+-"XXXXX"+-"XXXXX"+-XXXXX+-"XXXXX"+-"XXXXX"+-"XXXXX"+-"XXXXX"+-XXXXX+-"XXXXXX"+-"XXXXXX"+-XXXXXX+-XXXXX+-"XXXXX"+"XXXXX"+"XXXXX"+"XXXXXX"++&group.field=host&hl.simple.post=</b>&group=true&qt=search&fq=mrank:[0+TO+100]&fq=word_count:[70+TO+*]
>> >>>> ******
>> >>>>
>> >>>> XXXX is the words. All phrases "xxxxx"  has two words inside.
>> >>>>
>> >>>> The timing from the DebugQuery:
>> >>>>
>> >>>> <lst name="timing">
>> >>>> <double name="time">8654.0</double>
>> >>>> <lst name="prepare">
>> >>>> <double name="time">16.0</double>
>> >>>> <lst name="org.apache.solr.handler.component.QueryComponent">
>> >>>> <double name="time">16.0</double>
>> >>>> </lst>
>> >>>> <lst name="org.apache.solr.handler.component.FacetComponent">
>> >>>> <double name="time">0.0</double>
>> >>>> </lst>
>> >>>> <lst name="org.apache.solr.handler.component.MoreLikeThisComponent">
>> >>>> <double name="time">0.0</double>
>> >>>> </lst>
>> >>>> <lst name="org.apache.solr.handler.component.HighlightComponent">
>> >>>> <double name="time">0.0</double>
>> >>>> </lst>
>> >>>> <lst name="org.apache.solr.handler.component.StatsComponent">
>> >>>> <double name="time">0.0</double>
>> >>>> </lst>
>> >>>> <lst name="org.apache.solr.handler.component.SpellCheckComponent">
>> >>>> <double name="time">0.0</double>
>> >>>> </lst>
>> >>>> <lst name="org.apache.solr.handler.component.DebugComponent">
>> >>>> <double name="time">0.0</double>
>> >>>> </lst>
>> >>>> </lst>
>> >>>> <lst name="process">
>> >>>> <double name="time">8638.0</double>
>> >>>> <lst name="org.apache.solr.handler.component.QueryComponent">
>> >>>> <double name="time">4473.0</double>
>> >>>> </lst>
>> >>>> <lst name="org.apache.solr.handler.component.FacetComponent">
>> >>>> <double name="time">0.0</double>
>> >>>> </lst>
>> >>>> <lst name="org.apache.solr.handler.component.MoreLikeThisComponent">
>> >>>> <double name="time">0.0</double>
>> >>>> </lst>
>> >>>> <lst name="org.apache.solr.handler.component.HighlightComponent">
>> >>>> <double name="time">42.0</double>
>> >>>> </lst>
>> >>>> <lst name="org.apache.solr.handler.component.StatsComponent">
>> >>>> <double name="time">0.0</double>
>> >>>> </lst>
>> >>>> <lst name="org.apache.solr.handler.component.SpellCheckComponent">
>> >>>> <double name="time">1.0</double>
>> >>>> </lst>
>> >>>> <lst name="org.apache.solr.handler.component.DebugComponent">
>> >>>> <double name="time">4122.0</double>
>> >>>> </lst>
>> >>>>
>> >>>>
>> >>>> The funny thing is if I removed the ShingleFilter from the below
>> >> "sh_text"
>> >>>> field and index normally  the query time is half of the current
>> shingle
>> >> one
>> >>>> !. Shouldn't  be shingled index better for such heavy 2 word phrases
>> >> search
>> >>>> ? I am confused.
>> >>>>
>> >>>> On the other hand One of the on the shelf big FAT companies search
>> >> engine
>> >>>> doing the same query same machine 0.7 / 0.8 secs without cache . I am
>> >>>> confident we can do better in solr but couldnt find the way at the
>> >> moment.
>> >>>>
>> >>>> thanks for helping..
>> >>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>> On Sat, Aug 27, 2011 at 2:46 AM, Erik Hatcher <
>> erik.hatc...@gmail.com
>> >>> wrote:
>> >>>>
>> >>>>>
>> >>>>> On Aug 26, 2011, at 17:49 , Lord Khan Han wrote:
>> >>>>>> We are indexing news  document from the various sites. Currently we
>> >> have
>> >>>>>> 200K docs indexed. Total index size is 36 gig.  There is also
>> >>>>> attachement to
>> >>>>>> the news (pdf -docs etc) So document size could be high (ie 10mb).
>> >>>>>>
>> >>>>>> We are using some complex queries which includes around 30 - 40
>> terms
>> >>>>> per
>> >>>>>> query. %70 of this terms is two word phrases. We are using
>> >>>>>> with conjunction +  and -  to pinpoint exact result.
>> >>>>>> There is also grouping, dismax and boosting , Termvector HL  .
>> >>>>>
>> >>>>> You're using a lot of componentry there, and have complex queries.
>>  We
>> >>>>> need more details.
>> >>>>>
>> >>>>> Turn on debugQuery=true... what do the timings say for each
>> component?
>> >>>>>
>> >>>>>> Our problem is query times. Currently its around 6-7 secs. I know
>> our
>> >>>>> query
>> >>>>>> is little bit heavy but we want to improve query performance. I
>> >> believe
>> >>>>> we
>> >>>>>> can make it sub second but no succes at the moment.
>> >>>>>
>> >>>>> Please provide an example query or two (perhaps a full line logged
>> from
>> >>>>> Solr itself), and then let's see what debugQuery says about your
>> query
>> >> being
>> >>>>> parsed.
>> >>>>>
>> >>>>>> We tried to use shingle 2 word token it decreases the query
>> performcen
>> >>>>> !! We
>> >>>>>> assumed it will help the speed up phrases search..
>> >>>>>
>> >>>>> Again, we'd need to see a parsed query to understand this deeper.
>> >>>>>
>> >>>>> Lots of synonym expansion?  A parsed query will tell us.
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>>> (using solr latest trunk and HW is pretty good, 32 core  with 32
>> gig
>> >>>>> ram)
>> >>>>>>
>> >>>>>> Here the field def:
>> >>>>>>
>> >>>>>> <fieldType name="sh_text" class="solr.TextField"
>> >>>>> positionIncrementGap="100"
>> >>>>>> autoGeneratePhraseQueries="true">
>> >>>>>>    <analyzer type="index">
>> >>>>>>      <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>> >>>>>>      <filter class="solr.StopFilterFactory" ignoreCase="true"
>> >>>>>> words="stopwords.txt" enablePositionIncrements="true" />
>> >>>>>>      <filter class="solr.WordDelimiterFilterFactory"
>> >>>>>> generateWordParts="1" generateNumberParts="1" catenateWords="1"
>> >>>>>> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
>> >>>>>>      <!--<filter class="solr.LowerCaseFilterFactory"/>-->
>> >>>>>>      <filter class="solr.KeywordMarkerFilterFactory"
>> >>>>>> protected="protwords.txt"/>
>> >>>>>>      <filter class="solr.ShingleFilterFactory" maxShingleSize="2"
>> >>>>>> outputUnigrams="true"/>
>> >>>>>>    </analyzer>
>> >>>>>>    <analyzer type="query">
>> >>>>>>      <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>> >>>>>>      <filter class="solr.SynonymFilterFactory"
>> >> synonyms="synonyms.txt"
>> >>>>>> ignoreCase="true" expand="true"/>
>> >>>>>>      <filter class="solr.StopFilterFactory" ignoreCase="true"
>> >>>>>> words="stopwords.txt" enablePositionIncrements="true" />
>> >>>>>>      <filter class="solr.WordDelimiterFilterFactory"
>> >>>>>> generateWordParts="1" generateNumberParts="1" catenateWords="0"
>> >>>>>> catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
>> >>>>>>      <!--<filter class="solr.LowerCaseFilterFactory"/>-->
>> >>>>>>      <filter class="solr.KeywordMarkerFilterFactory"
>> >>>>>> protected="protwords.txt"/>
>> >>>>>>      <filter class="solr.ShingleFilterFactory" maxShingleSize="2"
>> >>>>>> outputUnigrams="true"/>
>> >>>>>>    </analyzer>
>> >>>>>>  </fieldType>
>> >>>>>>
>> >>>>>> and
>> >>>>>>
>> >>>>>> <field name="content" type="sh_text" stored="true" indexed="true"
>> >>>>>> termVectors="true" termPositions="true" termOffsets="true"/>
>> >>>>>
>> >>>>>
>> >>>>
>> >>
>> >>
>>
>>
>

Re: Shingle and Query Performance

Reply via email to