Okay, my fault. I had a misunderstanding as to under what conditions DataStax Enterprise 2 re-indexes the content, and thus while I had the field definitions set properly to support FVH, I believe no actual position and offset data was generated and that might indicate why I had empty highlights.
After re-indexing my content, I am now getting highlights from FVH, and I'm getting them noticeably faster. But, for some reason, the document ID is being highlighted. For example, for a given document using the old highlighter: <lst name="ING:6xwoe"> <arr name="n_name"> <str><span class="ingReasonText">egfr</span></str> </arr> <arr name="n_synonym"> <str><span class="ingReasonText">egfr</span></str> </arr> </lst> With FVH, I get: <lst name="ING:6xwoe"> <arr name="n_name"> <str><span class="ingReasonText">ING:</span>6xwoe egfr </str> </arr> <arr name="n_synonym"> <str><span class="ingReasonText">ING:</span>6xwoe egfr </str> </arr> </lst> Anybody ever seen that before? Thanks, Jeff On Apr 23, 2012, at 1:26 PM, Jeffrey Schmidt wrote: > This does not appear to be shingle specific. A non-shingled field is also > NOT highlighted in the same manner with FVH. I can see in the timing > information that it takes much longer to run FVH than no highlighting at all, > so Solr must be doing something. But why it just lists the document IDs and > little or no field highlights is still a mystery. > > Any ideas on where I should look in the configuration, parameters to try etc.? > > Cheers, > > Jeff > > On Apr 19, 2012, at 7:51 AM, Jeff Schmidt wrote: > >> I am using Solr 4.0, and debug=timing shows Solr spending the great majority >> of its time in the HighlightComponent. It seemed logical to look into the >> FastVectorHighlighter. I does seem much faster, but on the other hand, I'm >> not getting the highlights I need. :) >> >> I've seen references to FVH not supporting MultiTerm and (non-fixed sized) >> ngrams. I'm using edismax, and I don't know if a certain configuration of >> that becomes multi term and that's my problem, or if the is something >> completely different. I don't have ngrams, but I do shingle. For the >> examples below, I have these fields defined: >> >> <field name="n_macromolecule_name" type="text_lc_np_shingle" >> indexed="true" stored="true" multiValued="true" termVectors="true" >> termPositions="true" termOffsets="true" /> >> <field name="n_protein_family" type="text_lc_np_shingle" indexed="true" >> stored="true" multiValued="true" termVectors="true" termPositions="true" >> termOffsets="true" /> >> <field name="n_pathway_name" type="text_lc_np_shingle" indexed="true" >> stored="true" multiValued="true" termVectors="true" termPositions="true" >> termOffsets="true" /> >> <field name="n_cellreg_regulated_by" type="text_lc_np_shingle" >> indexed="true" stored="true" multiValued="true" termVectors="true" >> termPositions="true" termOffsets="true" /> >> <field name="n_cellreg_disease" type="text_lc_np_shingle" >> indexed="true" stored="true" multiValued="true" termVectors="true" >> termPositions="true" termOffsets="true" /> >> <field name="n_macromolecule_summary" type="text_lc_np_shingle" >> indexed="true" stored="true" multiValued="true" termVectors="true" >> termPositions="true" termOffsets="true"/> >> >> >> Note that all are both indexed and stored, multi-valued, and I have >> termVectors="true" termPositions="true" termOffsets="true" to enable FVH. >> When I had missed that in a field, I could see the log indicating such and >> reverting to the regular highlighter. I no longer see those messages. All >> of the above fields are of this type: >> >> <!-- A text field that forces lowercase, removes punctuation and >> generates shingles for phrase matching --> >> <fieldType name="text_lc_np_shingle" class="solr.TextField" >> positionIncrementGap="100"> >> <analyzer type="index"> >> <tokenizer class="solr.WhitespaceTokenizerFactory"/> >> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" >> ignoreCase="true" expand="true"/> >> <!-- strip punctuation --> >> <filter class="solr.PatternReplaceFilterFactory" >> pattern="([\p{Punct}])" replacement="" replace="all"/> >> <!-- Remove any 0-length tokens. --> >> <filter class="solr.LengthFilterFactory" min="1" max="100"/> >> <filter class="solr.LowerCaseFilterFactory"/> >> <filter class="solr.ShingleFilterFactory" maxShingleSize="4" >> outputUnigrams="true" /> >> </analyzer> >> <analyzer type="query"> >> <tokenizer class="solr.WhitespaceTokenizerFactory"/> >> <!-- strip punctuation --> >> <filter class="solr.PatternReplaceFilterFactory" >> pattern="([\p{Punct}])" replacement="" replace="all"/> >> <!-- Remove any 0-length tokens. --> >> <filter class="solr.LengthFilterFactory" min="1" max="100"/> >> <filter class="solr.LowerCaseFilterFactory"/> >> <filter class="solr.ShingleFilterFactory" maxShingleSize="4" >> outputUnigrams="false" outputUnigramsIfNoShingles="true"/> >> </analyzer> >> </fieldType> >> >> >> Using the standard highlight component, for the search term cancer (rows=2), >> I get the highlights I've come to appreciate: >> >> <lst name="highlighting"> >> <lst name="ING:3lzx"> >> <arr name="n_macromolecule_name"> >> <str><span class="ingReasonText">cancer</span> >> susceptibility candidate 1</str> >> </arr> >> <arr name="n_protein_family"> >> <str><span class="ingReasonText">Cancer</span> >> susceptibility candidate 1</str> >> </arr> >> </lst> >> <lst name="ING:8lj"> >> <arr name="n_macromolecule_name"> >> <str>breast <span >> class="ingReasonText">cancer</span> 2, early onset</str> >> </arr> >> <arr name="n_pathway_name"> >> <str>Hereditary Breast <span >> class="ingReasonText">Cancer</span> Signaling</str> >> </arr> >> <arr name="n_cellreg_regulated_by"> >> <str>prostate <span >> class="ingReasonText">cancer</span> cells</str> >> </arr> >> <arr name="n_cellreg_disease"> >> <str>breast <span >> class="ingReasonText">cancer</span></str> >> </arr> >> <arr name="n_macromolecule_summary"> >> <str> mutations in BRCA1 and this gene, BRCA2, confer >> increased lifetime risk of developing breast or ovarian <span >> class="ingReasonText">cancer.</span></str> >> </arr> >> </lst> >> </lst> >> >> With everything else being the same, when I set >> hl.useFastVectorHighlighter=true I get: >> >> <lst name="highlighting"> >> <lst name="ING:3lzx"/> >> <lst name="ING:8lj"> >> <arr name="n_macromolecule_summary"> >> <str>breast or <span >> class="ingReasonText">ovarian</span> cancer. Both BRCA1 and BRCA2 >> are involved in maintenance of genome stability, specifically</str> >> </arr> >> </lst> >> </lst> >> >> Note that the same fields simply do not appear, except for >> n_macromolecule_summary, in which case it's for some reason highlighting >> "ovarian" instead of "cancer". >> >> Highlight related configuration is in the edismax request handler: >> >> <str name="hl.requireFieldMatch">true</str> >> <str name="hl.usePhraseHighlighter">true</str> >> <str name="hl.phraseLimit">5000</str> >> <str name="hl.fragListBuilder">simple</str> >> <str name="hl.fragmentsBuilder">colored</str> >> <str name="hl.simple.pre"><![CDATA[<span class="ingReasonText">]]></str> >> <str name="hl.simple.post"><![CDATA[</span>]]></str> >> <str name="hl.tag.pre"><![CDATA[<span class="ingReasonText">]]></str> >> <str name="hl.tag.post"><![CDATA[</span>]]></str> >> >> <!-- for this field, we want no fragmenting, just highlighting --> >> <str name="f.name.hl.fragsize">0</str> >> <!-- instructs Solr to return the field itself if no query terms are >> found >> <str name="f.name.hl.alternateField">name</str> --> >> <str name="f.text.hl.fragmenter">regex</str> <!-- defined below --> >> >> Any ideas on what I'm doing wrong? Sorry for the long email, but I"m trying >> to answer as many anticipated configuration questions as I can. Is there a >> problem with FVH and shingling? Hopefully it's something else? >> >> Thanks, >> >> Jeff >> -- >> Jeff Schmidt >> 535 Consulting >> j...@535consulting.com >> http://www.535consulting.com >> (650) 423-1068 >> >> >> >> >> >> >> >> >> >> > > -- > Jeff Schmidt > jeff_schm...@mac.com >