Dear all,
I'm try to get highlighting working, and I'm almost done, but that's not
perfect yet...
Basically my documents have a title and a description.
I have two kind of text fields :
text :
<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
generateNumberParts="1" catenateWords="1" catenateNumbers="1"
catenateAll="0" splitOnCaseChange="1"/>
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" enablePositionIncrements="true" />
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
generateNumberParts="1" catenateWords="0" catenateNumbers="0"
catenateAll="0" splitOnCaseChange="1"/>
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" enablePositionIncrements="true" />
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldType>
and text_french_light :
<fieldType name="text_french_light" class="solr.TextField"
positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
generateNumberParts="1" catenateWords="1" catenateNumbers="1"
catenateAll="0" splitOnCaseChange="1"/>
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" enablePositionIncrements="true" />
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.FrenchLightStemFilterFactory" />
<filter class="solr.ASCIIFoldingFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
generateNumberParts="1" catenateWords="0" catenateNumbers="0"
catenateAll="0" splitOnCaseChange="1"/>
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" enablePositionIncrements="true" />
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.FrenchLightStemFilterFactory" />
<filter class="solr.ASCIIFoldingFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldType>
I then define my fields the following way :
<field name="title" type="text" indexed="true" stored="true"
termVectors="true" termPositions="true" termOffsets="true"/>
<field name="title_stemmed" type="text_french_light" indexed="true"
stored="true" termVectors="true" termPositions="true" termOffsets="true"/>
<field name="title_stemmed_nonorms" type="text_french_light"
indexed="true" stored="false" omitNorms="true"
omitTermFreqAndPositions="true"/>
<field name="description" type="text" indexed="true" stored="true"
termVectors="true" termPositions="true" termOffsets="true"/>
<field name="description_stemmed" type="text_french_light"
indexed="true" stored="true" termVectors="true" termPositions="true"
termOffsets="true"/>
<field name="description_stemmed_nonorms" type="text_french_light"
indexed="true" stored="false" omitNorms="true"
omitTermFreqAndPositions="true"/>
I have the following copyField directives :
<copyField source="title" dest="title_stemmed" />
<copyField source="title" dest="title_stemmed_nonorms" />
<copyField source="description" dest="description_stemmed" />
<copyField source="description" dest="description_stemmed_nonorms" />
I rely on dismax query handler to achieve relevancy.
I have two different search use cases :
- a "structured search" mode where my query looks like q="Term1
term2"&qf=my_category_field^1.0&hl.q=Word1 word2&mm=100%
- a "free-text search" mode where my query looks like q=Term1
term2&qf=title_stemmed_nonorms^1.0 description_stemmed_nonorms^0.5&mm=-40%
Shared query parameters are as follow :
defType=dismax&hl=on&hl.fl=title_stemmed
description_stemmed&hl.useFastVectorHighlighter=true&hl.fragListBuilder=single
For all use cases, I have the good relevancy parameters, my results are
satisfying.
Troubles concern highlighting :
- in the "free-text search" mode, everything is fine : the query is not
a phrase query, and highlighted terms may vary from the query terms (if
stemming came into play)
- in the "structured search" mode, I've got less luck : the query is a
phrase query. Therefor, I rely on the hl.q parameter to achieve my
needs. However, when specified in the hl.q parameter the query isn't
processed the same way that it should when trying to highlight from the
fields : query analysis seems not to be applied.
I can prove it easily by entering my query term that isn't highlighted
in the analysis.jsp page, obtain it's stemmed version, use that in the
hl.q parameter, and then I can see my highlighted terms as expected.
I suspect a bug arround the handling of the default query (hl.q) when
fields to highlight have a custom analysis (especially when stemming,
word delimiters, and so on are involved).
I tried playing with hl.usePhraseHighlighter=true and
hl.highlightMultiTerm=true but that didn't help at all =D
I tried using both legacy highlighter and FVH but the same issue occurs.
The issue only triggers when relying on hl.q.
Thank you very much for any help,
--
Tanguy