HighLithing exact phrases with solr

Antonio Calò Mon, 05 Oct 2009 09:49:34 -0700

Hi Guys

I'm getting crazy with the highlighting in solr. The problem is the follow:
when I submit an exact phrase query, I get the related results and the
related snippets with highlight. But I've noticed that the *single term of
the phrase are highlighted too*. Here an example:


If I start a search for "quick brown fox", I obtain the correct result with
the doc wich contains the phrase, but the snippets came to me like this:

<lst name="highlighting">
     <lst name="14">
        <arr name="DocumentText">
            <str>
The <em>quick brown fox</em> jump over the lazy dog. The <em>fox</em> is a
nice animal.
            </str>
     </arr>
  </lst>
</lst>


Also with some documents, only single terms are highlighted insteand of
exact sentence even if the exact phrase is contained into the document i.
e.:
<lst name="highlighting">
     <lst name="14">
        <arr name="DocumentText">
            <str>
The <em>fox</em> is a nice animal.
            </str>
     </arr>
  </lst>
</lst>


My understanding of highlighting is that if I search for exact phrase, only
the exact phrase is should be highlighted.

Here an extract of my solrconfig.xml & schema.xml

solrconfig.xml:

<highlighting>
   <!-- Configure the standard fragmenter -->
   <!-- This could most likely be commented out in the "default" case -->
   <fragmenter name="gap" class="org.apache.solr.highlight.GapFragmenter">
    <lst name="defaults">
     <int name="hl.fragsize">500</int>
    </lst>
   </fragmenter>

   <!-- A regular-expression-based fragmenter (f.i., for sentence
extraction) -->
   <fragmenter name="regex"
class="org.apache.solr.highlight.RegexFragmenter" default="true">
    <lst name="defaults">
      <!-- slightly smaller fragsizes work better because of slop -->
      <int name="hl.fragsize">700</int>
      <!-- allow 50% slop on fragment sizes -->
      <float name="hl.regex.slop">0.5</float>
      <!-- a basic sentence pattern -->
      <str name="hl.regex.pattern">[-\w ,/\n\"']{20,200}</str>

      <bool name="hl.usePhraseHighlighter">true</bool>

      <bool name="hl.highlightMultiTerm">true</bool>
    </lst>
   </fragmenter>

   <!-- Configure the standard formatter -->
   <formatter name="html" class="org.apache.solr.highlight.HtmlFormatter">
    <lst name="highlighting">
     <str name="hl.simple.pre"><![CDATA[<strong>]]></str>
     <str name="hl.simple.post"><![CDATA[</strong>]]></str>
    </lst>
   </formatter>


schema.xml:

<analyzer type="index">
                <tokenizer class="solr.WhitespaceTokenizerFactory"/>
                 <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stop_italiano.txt"/>
                <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="1"
catenateNumbers="1" catenateAll="0"/>
                <filter class="solr.LowerCaseFilterFactory"/>
                  <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
            </analyzer>


            <analyzer type="query">
                <tokenizer class="solr.WhitespaceTokenizerFactory"/>
                <filter class="solr.SynonymFilterFactory"
synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
                <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stop_italiano.txt"/>
                <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="1"
catenateNumbers="1" catenateAll="1"/>
                <filter class="solr.LowerCaseFilterFactory"/>
                <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
            </analyzer>
        </fieldtype>


Maybe I'm missing something, or my understanding of the highlighting feature
is not correct. Any Idea?

As always, thanks for your support!

Regards, Antonio

HighLithing exact phrases with solr

Reply via email to