Re: slow highlighting because of stemming

Mike Sokolov Fri, 29 Jul 2011 14:41:48 -0700

I'm not sure I would identify stemming as the culprit here.

Do you have very large documents? If so, there is a patch for FVHcommitted to limit the number of phrases it looks at; seehl.phraseLimit, but this won't be available until 3.4 is released.

You can also limit the amount of each document that is analyzed by theregular Highlighter using maxDocCharsToAnalyze (and maybe this appliesto FVH? not sure)

Using RegexFragmenter is also probably slower than something likeSimpleFragmenter.

There is work to implement faster highlighting for Solr/Lucene, but itdepends on some basic changes to the search architecture so it might bea while before that becomes available. Seehttps://issues.apache.org/jira/browse/LUCENE-3318 if you're interestedin following that development.


-Mike

On 07/29/2011 04:55 AM, Orosz György wrote:

Dear all,

I am quite new about using Solr, but would like to ask your help.
I am developing an application which should be able to highlight the results
of a query. For this I am using regex fragmenter:
<highlighting>
    <fragmenter name="regex"
class="org.apache.solr.highlight.RegexFragmenter">
     <lst name="defaults">
       <int name="hl.fragsize">500</int>
       <float name="hl.regex.slop">0.5</float>
       <str name="hl.pre"><![CDATA[<b>]]></str>
      <str name="hl.post"><![CDATA[</b>]]></str>
      <str name="hl.useFastVectorHighlighter">true</str>
       <str name="hl.regex.pattern">[-\w ,/\n\"']{20,300}[.?!]</str>
       <str name="hl.fl">dokumentum_syn_query</str>
     </lst>
    </fragmenter>
   </highlighting>
The field is indexed with term vectors and offsets:
<field name="dokumentum_syn_query" type="huntext_syn" indexed="true"
stored="true" multiValued="true" termVectors="on" termPositions="on"
  termOffsets="on"/>
     <fieldType name="huntext_syn" class="solr.TextField" stored="true"
indexed="true" positionIncrementGap="100">
       <analyzer type="index">
         <tokenizer
class="com.morphologic.solr.huntoken.HunTokenizerFactory"/>
         <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords_query.txt" enablePositionIncrements="true" />
  <filter class="com.morphologic.solr.hunstem.HumorStemFilterFactory"
  lex="/home/oroszgy/workspace/morpho/solrplugins/data/lex"
  cache="alma"/>
         <filter class="solr.LowerCaseFilterFactory"/>
       </analyzer>
       <analyzer type="query">
         <tokenizer class="solr.StandardTokenizerFactory"/>
  <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords_query.txt" enablePositionIncrements="true" />
  <filter class="com.morphologic.solr.hunstem.HumorStemFilterFactory"
  lex="/home/oroszgy/workspace/morpho/solrplugins/data/lex"
  cache="alma"/>
         <filter class="solr.SynonymFilterFactory"
synonyms="synonyms_query.txt" ignoreCase="true" expand="true"/>
         <filter class="solr.LowerCaseFilterFactory"/>
       </analyzer>
     </fieldType>

The highlighting works well, excepts that its really slow. I realized that
this is because the highlighter/fragmenter does stemming for all the results
documents again.

Could you please help me why does it happen an how should I avoid this? (I
thought that using fastvectorhighlighter will solve my problem, but it
didn't)

Thanks in advance!
Gyuri Orosz

Re: slow highlighting because of stemming

Reply via email to