This does not appear to be shingle specific. A non-shingled field is also NOT highlighted in the same manner with FVH. I can see in the timing information that it takes much longer to run FVH than no highlighting at all, so Solr must be doing something. But why it just lists the document IDs and little or no field highlights is still a mystery.
Any ideas on where I should look in the configuration, parameters to try etc.? Cheers, Jeff On Apr 19, 2012, at 7:51 AM, Jeff Schmidt wrote: > I am using Solr 4.0, and debug=timing shows Solr spending the great majority > of its time in the HighlightComponent. It seemed logical to look into the > FastVectorHighlighter. I does seem much faster, but on the other hand, I'm > not getting the highlights I need. :) > > I've seen references to FVH not supporting MultiTerm and (non-fixed sized) > ngrams. I'm using edismax, and I don't know if a certain configuration of > that becomes multi term and that's my problem, or if the is something > completely different. I don't have ngrams, but I do shingle. For the > examples below, I have these fields defined: > > <field name="n_macromolecule_name" type="text_lc_np_shingle" > indexed="true" stored="true" multiValued="true" termVectors="true" > termPositions="true" termOffsets="true" /> > <field name="n_protein_family" type="text_lc_np_shingle" indexed="true" > stored="true" multiValued="true" termVectors="true" termPositions="true" > termOffsets="true" /> > <field name="n_pathway_name" type="text_lc_np_shingle" indexed="true" > stored="true" multiValued="true" termVectors="true" termPositions="true" > termOffsets="true" /> > <field name="n_cellreg_regulated_by" type="text_lc_np_shingle" > indexed="true" stored="true" multiValued="true" termVectors="true" > termPositions="true" termOffsets="true" /> > <field name="n_cellreg_disease" type="text_lc_np_shingle" > indexed="true" stored="true" multiValued="true" termVectors="true" > termPositions="true" termOffsets="true" /> > <field name="n_macromolecule_summary" type="text_lc_np_shingle" > indexed="true" stored="true" multiValued="true" termVectors="true" > termPositions="true" termOffsets="true"/> > > > Note that all are both indexed and stored, multi-valued, and I have > termVectors="true" termPositions="true" termOffsets="true" to enable FVH. > When I had missed that in a field, I could see the log indicating such and > reverting to the regular highlighter. I no longer see those messages. All of > the above fields are of this type: > > <!-- A text field that forces lowercase, removes punctuation and > generates shingles for phrase matching --> > <fieldType name="text_lc_np_shingle" class="solr.TextField" > positionIncrementGap="100"> > <analyzer type="index"> > <tokenizer class="solr.WhitespaceTokenizerFactory"/> > <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" > ignoreCase="true" expand="true"/> > <!-- strip punctuation --> > <filter class="solr.PatternReplaceFilterFactory" > pattern="([\p{Punct}])" replacement="" replace="all"/> > <!-- Remove any 0-length tokens. --> > <filter class="solr.LengthFilterFactory" min="1" max="100"/> > <filter class="solr.LowerCaseFilterFactory"/> > <filter class="solr.ShingleFilterFactory" maxShingleSize="4" > outputUnigrams="true" /> > </analyzer> > <analyzer type="query"> > <tokenizer class="solr.WhitespaceTokenizerFactory"/> > <!-- strip punctuation --> > <filter class="solr.PatternReplaceFilterFactory" > pattern="([\p{Punct}])" replacement="" replace="all"/> > <!-- Remove any 0-length tokens. --> > <filter class="solr.LengthFilterFactory" min="1" max="100"/> > <filter class="solr.LowerCaseFilterFactory"/> > <filter class="solr.ShingleFilterFactory" maxShingleSize="4" > outputUnigrams="false" outputUnigramsIfNoShingles="true"/> > </analyzer> > </fieldType> > > > Using the standard highlight component, for the search term cancer (rows=2), > I get the highlights I've come to appreciate: > > <lst name="highlighting"> > <lst name="ING:3lzx"> > <arr name="n_macromolecule_name"> > <str><span class="ingReasonText">cancer</span> > susceptibility candidate 1</str> > </arr> > <arr name="n_protein_family"> > <str><span class="ingReasonText">Cancer</span> > susceptibility candidate 1</str> > </arr> > </lst> > <lst name="ING:8lj"> > <arr name="n_macromolecule_name"> > <str>breast <span > class="ingReasonText">cancer</span> 2, early onset</str> > </arr> > <arr name="n_pathway_name"> > <str>Hereditary Breast <span > class="ingReasonText">Cancer</span> Signaling</str> > </arr> > <arr name="n_cellreg_regulated_by"> > <str>prostate <span > class="ingReasonText">cancer</span> cells</str> > </arr> > <arr name="n_cellreg_disease"> > <str>breast <span > class="ingReasonText">cancer</span></str> > </arr> > <arr name="n_macromolecule_summary"> > <str> mutations in BRCA1 and this gene, BRCA2, confer > increased lifetime risk of developing breast or ovarian <span > class="ingReasonText">cancer.</span></str> > </arr> > </lst> > </lst> > > With everything else being the same, when I set > hl.useFastVectorHighlighter=true I get: > > <lst name="highlighting"> > <lst name="ING:3lzx"/> > <lst name="ING:8lj"> > <arr name="n_macromolecule_summary"> > <str>breast or <span > class="ingReasonText">ovarian</span> cancer. Both BRCA1 and BRCA2 > are involved in maintenance of genome stability, specifically</str> > </arr> > </lst> > </lst> > > Note that the same fields simply do not appear, except for > n_macromolecule_summary, in which case it's for some reason highlighting > "ovarian" instead of "cancer". > > Highlight related configuration is in the edismax request handler: > > <str name="hl.requireFieldMatch">true</str> > <str name="hl.usePhraseHighlighter">true</str> > <str name="hl.phraseLimit">5000</str> > <str name="hl.fragListBuilder">simple</str> > <str name="hl.fragmentsBuilder">colored</str> > <str name="hl.simple.pre"><![CDATA[<span class="ingReasonText">]]></str> > <str name="hl.simple.post"><![CDATA[</span>]]></str> > <str name="hl.tag.pre"><![CDATA[<span class="ingReasonText">]]></str> > <str name="hl.tag.post"><![CDATA[</span>]]></str> > > <!-- for this field, we want no fragmenting, just highlighting --> > <str name="f.name.hl.fragsize">0</str> > <!-- instructs Solr to return the field itself if no query terms are > found > <str name="f.name.hl.alternateField">name</str> --> > <str name="f.text.hl.fragmenter">regex</str> <!-- defined below --> > > Any ideas on what I'm doing wrong? Sorry for the long email, but I"m trying > to answer as many anticipated configuration questions as I can. Is there a > problem with FVH and shingling? Hopefully it's something else? > > Thanks, > > Jeff > -- > Jeff Schmidt > 535 Consulting > j...@535consulting.com > http://www.535consulting.com > (650) 423-1068 > > > > > > > > > > -- Jeff Schmidt jeff_schm...@mac.com