Re: FastVectorHighlighter -> no highlights

Schmidt Jeff Fri, 27 Apr 2012 14:33:17 -0700

Okay, my fault. I had a misunderstanding as to under what conditions DataStax 
Enterprise 2 re-indexes the content, and thus while I had the field definitions 
set properly to support FVH, I believe no actual position and offset data was 
generated and that might indicate why I had empty highlights.


After re-indexing my content, I am now getting highlights from FVH, and I'm 
getting them noticeably faster.  But, for some reason, the document ID is being 
highlighted.  For example, for a given document using the old highlighter:

<lst name="ING:6xwoe">
    <arr name="n_name">
        <str><span class="ingReasonText">egfr</span></str>
    </arr>
    <arr name="n_synonym">
        <str><span class="ingReasonText">egfr</span></str>
    </arr>
</lst>

With FVH, I get:

<lst name="ING:6xwoe">
    <arr name="n_name">
        <str><span class="ingReasonText">ING:</span>6xwoe egfr </str>
    </arr>
    <arr name="n_synonym">
        <str><span class="ingReasonText">ING:</span>6xwoe egfr </str>
    </arr>
</lst>

Anybody ever seen that before?

Thanks,

Jeff

On Apr 23, 2012, at 1:26 PM, Jeffrey Schmidt wrote:

> This does not appear to be shingle specific.  A non-shingled field is also 
> NOT highlighted in the same manner with FVH.  I can see in the timing 
> information that it takes much longer to run FVH than no highlighting at all, 
> so Solr must be doing something.  But why it just lists the document IDs and 
> little or no field highlights is still a mystery.
> 
> Any ideas on where I should look in the configuration, parameters to try etc.?
> 
> Cheers,
> 
> Jeff
> 
> On Apr 19, 2012, at 7:51 AM, Jeff Schmidt wrote:
> 
>> I am using Solr 4.0, and debug=timing shows Solr spending the great majority 
>> of its time in the HighlightComponent. It seemed logical to look into the 
>> FastVectorHighlighter.  I does seem much faster, but on the other hand, I'm 
>> not getting the highlights I need. :)
>> 
>> I've seen references to FVH not supporting MultiTerm and (non-fixed sized) 
>> ngrams.  I'm using edismax, and I don't know if a certain configuration of 
>> that becomes multi term and that's my problem, or if the is something 
>> completely different. I don't have ngrams, but I do shingle.  For the 
>> examples below, I have these fields defined:
>> 
>>      <field name="n_macromolecule_name" type="text_lc_np_shingle" 
>> indexed="true" stored="true" multiValued="true" termVectors="true" 
>> termPositions="true" termOffsets="true" />
>>      <field name="n_protein_family" type="text_lc_np_shingle" indexed="true" 
>> stored="true" multiValued="true" termVectors="true" termPositions="true" 
>> termOffsets="true" />
>>      <field name="n_pathway_name" type="text_lc_np_shingle" indexed="true" 
>> stored="true" multiValued="true" termVectors="true" termPositions="true" 
>> termOffsets="true" />
>>      <field name="n_cellreg_regulated_by" type="text_lc_np_shingle" 
>> indexed="true" stored="true" multiValued="true" termVectors="true" 
>> termPositions="true" termOffsets="true" />
>>      <field name="n_cellreg_disease" type="text_lc_np_shingle" 
>> indexed="true" stored="true" multiValued="true" termVectors="true" 
>> termPositions="true" termOffsets="true" />
>>      <field name="n_macromolecule_summary" type="text_lc_np_shingle" 
>> indexed="true" stored="true" multiValued="true" termVectors="true" 
>> termPositions="true" termOffsets="true"/>
>> 
>> 
>> Note that all are both indexed and stored, multi-valued, and I have  
>> termVectors="true" termPositions="true" termOffsets="true" to enable FVH. 
>> When I had missed that in a field, I could see the log indicating such and 
>> reverting to the regular highlighter. I no longer see those messages.  All 
>> of the above fields are of this type:
>> 
>>        <!-- A text field that forces lowercase, removes punctuation and 
>> generates shingles for phrase matching -->
>>       <fieldType name="text_lc_np_shingle" class="solr.TextField" 
>> positionIncrementGap="100">
>>         <analyzer type="index">
>>           <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>           <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" 
>> ignoreCase="true" expand="true"/>
>>           <!-- strip punctuation -->
>>           <filter class="solr.PatternReplaceFilterFactory"
>>               pattern="([\p{Punct}])" replacement="" replace="all"/>
>>           <!-- Remove any 0-length tokens. -->
>>           <filter class="solr.LengthFilterFactory" min="1" max="100"/>
>>           <filter class="solr.LowerCaseFilterFactory"/>
>>           <filter class="solr.ShingleFilterFactory" maxShingleSize="4" 
>> outputUnigrams="true" />         
>>         </analyzer>
>>         <analyzer type="query">
>>           <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>           <!-- strip punctuation -->
>>           <filter class="solr.PatternReplaceFilterFactory"
>>               pattern="([\p{Punct}])" replacement="" replace="all"/>
>>           <!-- Remove any 0-length tokens. -->
>>           <filter class="solr.LengthFilterFactory" min="1" max="100"/>
>>           <filter class="solr.LowerCaseFilterFactory"/>
>>           <filter class="solr.ShingleFilterFactory" maxShingleSize="4" 
>> outputUnigrams="false" outputUnigramsIfNoShingles="true"/>
>>         </analyzer>
>>       </fieldType>
>> 
>> 
>> Using the standard highlight component, for the search term cancer (rows=2), 
>> I get the highlights I've come to appreciate:
>> 
>>    <lst name="highlighting">
>>        <lst name="ING:3lzx">
>>            <arr name="n_macromolecule_name">
>>                <str>&lt;span class="ingReasonText"&gt;cancer&lt;/span&gt; 
>> susceptibility candidate 1</str>
>>            </arr>
>>            <arr name="n_protein_family">
>>                <str>&lt;span class="ingReasonText"&gt;Cancer&lt;/span&gt; 
>> susceptibility candidate 1</str>
>>            </arr>
>>        </lst>
>>        <lst name="ING:8lj">
>>            <arr name="n_macromolecule_name">
>>                <str>breast &lt;span 
>> class="ingReasonText"&gt;cancer&lt;/span&gt; 2, early onset</str>
>>            </arr>
>>            <arr name="n_pathway_name">
>>                <str>Hereditary Breast &lt;span 
>> class="ingReasonText"&gt;Cancer&lt;/span&gt; Signaling</str>
>>            </arr>
>>            <arr name="n_cellreg_regulated_by">
>>                <str>prostate &lt;span 
>> class="ingReasonText"&gt;cancer&lt;/span&gt; cells</str>
>>            </arr>
>>            <arr name="n_cellreg_disease">
>>                <str>breast &lt;span 
>> class="ingReasonText"&gt;cancer&lt;/span&gt;</str>
>>            </arr>
>>            <arr name="n_macromolecule_summary">
>>                <str> mutations in BRCA1 and this gene, BRCA2, confer 
>> increased lifetime risk of developing breast or ovarian &lt;span 
>> class="ingReasonText"&gt;cancer.&lt;/span&gt;</str>
>>            </arr>
>>        </lst>
>>    </lst>
>> 
>> With everything else being the same, when I set 
>> hl.useFastVectorHighlighter=true I get:
>> 
>>    <lst name="highlighting">
>>        <lst name="ING:3lzx"/>
>>        <lst name="ING:8lj">
>>            <arr name="n_macromolecule_summary">
>>                <str>breast or &lt;span 
>> class="ingReasonText"&gt;ovarian&lt;/span&gt; cancer. Both BRCA1 and BRCA2 
>> are involved in maintenance of genome stability, specifically</str>
>>            </arr>
>>        </lst>
>>    </lst>
>> 
>> Note that the same fields simply do not appear, except for 
>> n_macromolecule_summary, in which case it's for some reason highlighting 
>> "ovarian" instead of "cancer".
>> 
>> Highlight related configuration is in the edismax request handler:
>> 
>>     <str name="hl.requireFieldMatch">true</str>
>>     <str name="hl.usePhraseHighlighter">true</str>
>>     <str name="hl.phraseLimit">5000</str>
>>     <str name="hl.fragListBuilder">simple</str>
>>     <str name="hl.fragmentsBuilder">colored</str>
>>     <str name="hl.simple.pre"><![CDATA[<span class="ingReasonText">]]></str>
>>     <str name="hl.simple.post"><![CDATA[</span>]]></str>
>>     <str name="hl.tag.pre"><![CDATA[<span class="ingReasonText">]]></str>
>>     <str name="hl.tag.post"><![CDATA[</span>]]></str>
>> 
>>     <!-- for this field, we want no fragmenting, just highlighting -->
>>     <str name="f.name.hl.fragsize">0</str>
>>     <!-- instructs Solr to return the field itself if no query terms are
>>          found
>>     <str name="f.name.hl.alternateField">name</str> -->
>>     <str name="f.text.hl.fragmenter">regex</str> <!-- defined below -->
>> 
>> Any ideas on what I'm doing wrong?  Sorry for the long email, but I"m trying 
>> to answer as many anticipated configuration questions as I can. Is there a 
>> problem with FVH and shingling?  Hopefully it's something else?
>> 
>> Thanks,
>> 
>> Jeff
>> --
>> Jeff Schmidt
>> 535 Consulting
>> j...@535consulting.com
>> http://www.535consulting.com
>> (650) 423-1068
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
> 
> --
> Jeff Schmidt
> jeff_schm...@mac.com
>

Re: FastVectorHighlighter -> no highlights

Reply via email to