Re: Shingle and Query Performance

Lord Khan Han Sat, 27 Aug 2011 09:58:58 -0700

Hi,

What is the difference between solr 3.3  and the trunk ?
I will try 3.3  and let you know the results.



Here the search handler:

<requestHandler name="search" class="solr.SearchHandler" default="true">
     <lst name="defaults">
       <str name="echoParams">explicit</str>
       <int name="rows">10</int>
       <!--<str name="fq">category:vv</str>-->
 <str name="fq">mrank:[0 TO 100]</str>
       <str name="echoParams">explicit</str>
       <int name="rows">10</int>
 <str name="defType">edismax</str>
       <!--<str name="qf">title^0.05 url^1.2 content^1.7
m_title^10.0</str>-->
<str name="qf">title^1.05 url^1.2 content^1.7 m_title^10.0</str>
<!-- <str name="bf">recip(ee_score,-0.85,1,0.2)</str> -->
 <str name="pf">content^18.0 m_title^5.0</str>
 <int name="ps">1</int>
 <int name="qs">0</int>
 <str name="mm">2&lt;-25%</str>
 <str name="spellcheck">true</str>
 <!--<str name="spellcheck.collate">true</str>   -->
<str name="spellcheck.count">5</str>
<str name="spellcheck.dictionary">subobjective</str>
<str name="spellcheck.onlyMorePopular">false</str>
  <str name="hl.tag.pre">&lt;b&gt;</str>
<str name="hl.tag.post">&lt;/b&gt;</str>
 <str name="hl.useFastVectorHighlighter">true</str>
     </lst>




On Sat, Aug 27, 2011 at 5:31 PM, Erik Hatcher <[email protected]>wrote:

> I'm not sure what the issue could be at this point.   I see you've got
> qt=search - what's the definition of that request handler?
>
> What is the parsed query (from the debugQuery response)?
>
> Have you tried this with Solr 3.3 to see if there's any appreciable
> difference?
>
>        Erik
>
> On Aug 27, 2011, at 09:34 , Lord Khan Han wrote:
>
> > When grouping off the query time ie 3567 ms  to 1912 ms . Grouping
> > increasing the query time and make useless to cache. But same config
> faster
> > without shingle still.
> >
> > We have and head to head test this wednesday tihs commercial search
> engine.
> > So I am looking for all suggestions.
> >
> >
> >
> > On Sat, Aug 27, 2011 at 3:37 PM, Erik Hatcher <[email protected]
> >wrote:
> >
> >> Please confirm is this is caused by grouping.  Turn grouping off, what's
> >> query time like?
> >>
> >>
> >> On Aug 27, 2011, at 07:27 , Lord Khan Han wrote:
> >>
> >>> On the other hand We couldnt use the cache for below types queries. I
> >> think
> >>> its caused from grouping. Anyway we need to be sub second without
> cache.
> >>>
> >>>
> >>>
> >>> On Sat, Aug 27, 2011 at 2:18 PM, Lord Khan Han <
> [email protected]
> >>> wrote:
> >>>
> >>>> Hi,
> >>>>
> >>>> Thanks for the reply.
> >>>>
> >>>> Here the solr log capture.:
> >>>>
> >>>> ******
> >>>>
> >>>>
> >>
> hl.fragsize=100&spellcheck=true&spellcheck.q=XXXXX&group.limit=5&hl.simple.pre=<b>&hl.fl=content&spellcheck.collate=true&wt=javabin&hl=true&rows=20&version=2&fl=score,approved,domain,host,id,lang,mimetype,title,tstamp,url,category&hl.snippets=3&start=0&q=%2BXXXX+-"XXXXX"+-"XXXXX"+-"XXXXXX"+-"XXXXXX"+-"XXXXXX"+-XXXX+-"XXXXXX"+-XXX+-"XXXXX"+-XXXX+-XXXX+-"XXXXX"+-"XXXXX"+-"XXXXX"+-XXXX+-"XXXX"+-"XXXXX"+-"XXXXXX"+-"XXXXX"+-"XXXXXX"+-"XXXXXX"+-XXXX+-"XXXXX"+-"XXXXXX"+-XXXX+-"XXXXX"+-"XXXXX"+-XXXXX+-"XXXXX"+-"XXXXX"+-"XXXXX"+-"XXXXX"+-XXXXX+-"XXXXXX"+-"XXXXXX"+-XXXXXX+-XXXXX+-"XXXXX"+"XXXXX"+"XXXXX"+"XXXXXX"++&group.field=host&hl.simple.post=</b>&group=true&qt=search&fq=mrank:[0+TO+100]&fq=word_count:[70+TO+*]
> >>>> ******
> >>>>
> >>>> XXXX is the words. All phrases "xxxxx"  has two words inside.
> >>>>
> >>>> The timing from the DebugQuery:
> >>>>
> >>>> <lst name="timing">
> >>>> <double name="time">8654.0</double>
> >>>> <lst name="prepare">
> >>>> <double name="time">16.0</double>
> >>>> <lst name="org.apache.solr.handler.component.QueryComponent">
> >>>> <double name="time">16.0</double>
> >>>> </lst>
> >>>> <lst name="org.apache.solr.handler.component.FacetComponent">
> >>>> <double name="time">0.0</double>
> >>>> </lst>
> >>>> <lst name="org.apache.solr.handler.component.MoreLikeThisComponent">
> >>>> <double name="time">0.0</double>
> >>>> </lst>
> >>>> <lst name="org.apache.solr.handler.component.HighlightComponent">
> >>>> <double name="time">0.0</double>
> >>>> </lst>
> >>>> <lst name="org.apache.solr.handler.component.StatsComponent">
> >>>> <double name="time">0.0</double>
> >>>> </lst>
> >>>> <lst name="org.apache.solr.handler.component.SpellCheckComponent">
> >>>> <double name="time">0.0</double>
> >>>> </lst>
> >>>> <lst name="org.apache.solr.handler.component.DebugComponent">
> >>>> <double name="time">0.0</double>
> >>>> </lst>
> >>>> </lst>
> >>>> <lst name="process">
> >>>> <double name="time">8638.0</double>
> >>>> <lst name="org.apache.solr.handler.component.QueryComponent">
> >>>> <double name="time">4473.0</double>
> >>>> </lst>
> >>>> <lst name="org.apache.solr.handler.component.FacetComponent">
> >>>> <double name="time">0.0</double>
> >>>> </lst>
> >>>> <lst name="org.apache.solr.handler.component.MoreLikeThisComponent">
> >>>> <double name="time">0.0</double>
> >>>> </lst>
> >>>> <lst name="org.apache.solr.handler.component.HighlightComponent">
> >>>> <double name="time">42.0</double>
> >>>> </lst>
> >>>> <lst name="org.apache.solr.handler.component.StatsComponent">
> >>>> <double name="time">0.0</double>
> >>>> </lst>
> >>>> <lst name="org.apache.solr.handler.component.SpellCheckComponent">
> >>>> <double name="time">1.0</double>
> >>>> </lst>
> >>>> <lst name="org.apache.solr.handler.component.DebugComponent">
> >>>> <double name="time">4122.0</double>
> >>>> </lst>
> >>>>
> >>>>
> >>>> The funny thing is if I removed the ShingleFilter from the below
> >> "sh_text"
> >>>> field and index normally  the query time is half of the current
> shingle
> >> one
> >>>> !. Shouldn't  be shingled index better for such heavy 2 word phrases
> >> search
> >>>> ? I am confused.
> >>>>
> >>>> On the other hand One of the on the shelf big FAT companies search
> >> engine
> >>>> doing the same query same machine 0.7 / 0.8 secs without cache . I am
> >>>> confident we can do better in solr but couldnt find the way at the
> >> moment.
> >>>>
> >>>> thanks for helping..
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> On Sat, Aug 27, 2011 at 2:46 AM, Erik Hatcher <[email protected]
> >>> wrote:
> >>>>
> >>>>>
> >>>>> On Aug 26, 2011, at 17:49 , Lord Khan Han wrote:
> >>>>>> We are indexing news  document from the various sites. Currently we
> >> have
> >>>>>> 200K docs indexed. Total index size is 36 gig.  There is also
> >>>>> attachement to
> >>>>>> the news (pdf -docs etc) So document size could be high (ie 10mb).
> >>>>>>
> >>>>>> We are using some complex queries which includes around 30 - 40
> terms
> >>>>> per
> >>>>>> query. %70 of this terms is two word phrases. We are using
> >>>>>> with conjunction +  and -  to pinpoint exact result.
> >>>>>> There is also grouping, dismax and boosting , Termvector HL  .
> >>>>>
> >>>>> You're using a lot of componentry there, and have complex queries.
>  We
> >>>>> need more details.
> >>>>>
> >>>>> Turn on debugQuery=true... what do the timings say for each
> component?
> >>>>>
> >>>>>> Our problem is query times. Currently its around 6-7 secs. I know
> our
> >>>>> query
> >>>>>> is little bit heavy but we want to improve query performance. I
> >> believe
> >>>>> we
> >>>>>> can make it sub second but no succes at the moment.
> >>>>>
> >>>>> Please provide an example query or two (perhaps a full line logged
> from
> >>>>> Solr itself), and then let's see what debugQuery says about your
> query
> >> being
> >>>>> parsed.
> >>>>>
> >>>>>> We tried to use shingle 2 word token it decreases the query
> performcen
> >>>>> !! We
> >>>>>> assumed it will help the speed up phrases search..
> >>>>>
> >>>>> Again, we'd need to see a parsed query to understand this deeper.
> >>>>>
> >>>>> Lots of synonym expansion?  A parsed query will tell us.
> >>>>>
> >>>>>
> >>>>>
> >>>>>> (using solr latest trunk and HW is pretty good, 32 core  with 32 gig
> >>>>> ram)
> >>>>>>
> >>>>>> Here the field def:
> >>>>>>
> >>>>>> <fieldType name="sh_text" class="solr.TextField"
> >>>>> positionIncrementGap="100"
> >>>>>> autoGeneratePhraseQueries="true">
> >>>>>>    <analyzer type="index">
> >>>>>>      <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> >>>>>>      <filter class="solr.StopFilterFactory" ignoreCase="true"
> >>>>>> words="stopwords.txt" enablePositionIncrements="true" />
> >>>>>>      <filter class="solr.WordDelimiterFilterFactory"
> >>>>>> generateWordParts="1" generateNumberParts="1" catenateWords="1"
> >>>>>> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
> >>>>>>      <!--<filter class="solr.LowerCaseFilterFactory"/>-->
> >>>>>>      <filter class="solr.KeywordMarkerFilterFactory"
> >>>>>> protected="protwords.txt"/>
> >>>>>>      <filter class="solr.ShingleFilterFactory" maxShingleSize="2"
> >>>>>> outputUnigrams="true"/>
> >>>>>>    </analyzer>
> >>>>>>    <analyzer type="query">
> >>>>>>      <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> >>>>>>      <filter class="solr.SynonymFilterFactory"
> >> synonyms="synonyms.txt"
> >>>>>> ignoreCase="true" expand="true"/>
> >>>>>>      <filter class="solr.StopFilterFactory" ignoreCase="true"
> >>>>>> words="stopwords.txt" enablePositionIncrements="true" />
> >>>>>>      <filter class="solr.WordDelimiterFilterFactory"
> >>>>>> generateWordParts="1" generateNumberParts="1" catenateWords="0"
> >>>>>> catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
> >>>>>>      <!--<filter class="solr.LowerCaseFilterFactory"/>-->
> >>>>>>      <filter class="solr.KeywordMarkerFilterFactory"
> >>>>>> protected="protwords.txt"/>
> >>>>>>      <filter class="solr.ShingleFilterFactory" maxShingleSize="2"
> >>>>>> outputUnigrams="true"/>
> >>>>>>    </analyzer>
> >>>>>>  </fieldType>
> >>>>>>
> >>>>>> and
> >>>>>>
> >>>>>> <field name="content" type="sh_text" stored="true" indexed="true"
> >>>>>> termVectors="true" termPositions="true" termOffsets="true"/>
> >>>>>
> >>>>>
> >>>>
> >>
> >>
>
>

Re: Shingle and Query Performance

Reply via email to