(Edge)NGramFilterFactory and highlight

Bjørn Hjelle Fri, 19 Dec 2014 06:29:56 -0800

Hi,

based on this example:
http://www.cominvent.com/2012/01/25/super-flexible-autocomplete-with-solr/
I have earlier successfully implemented highlight of terms in
(Edge)NGram-analyzed fields.


In a new project, however, with Solr 4.10.2 it does not work.

In the Solr admin analysis page I see the following in Solr 4.10.2 (simplified):

ENGTF  text  t  te  tes  test
               start 0  0   0    0
               end   4  4   4    4

But if I change to LUCENE_43 in solrconfig.xml, and reload the
analysis page I get this:

ENGTF  text  t  te  tes  test
               start 0  0   0    0
               end   1  2   3    4

So, in 4.10.2 it is not able to find the correct end-positions and the
highlighter will instead highlight the complete word ("test" in this
case).


To reproduce  this:
1. download Solr 4.10.2
2. In the collection1 schema.xml, add field type:


        <fieldType name="autocomplete_ngram" class="solr.TextField">
            <analyzer type="index">
                <charFilter class="solr.MappingCharFilterFactory"
mapping="mapping-ISOLatin1Accent.txt"/>
                <tokenizer class="solr.StandardTokenizerFactory"/>
                <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="0"
catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
                <filter class="solr.LowerCaseFilterFactory"/>
                <filter class="solr.EdgeNGramFilterFactory"
maxGramSize="20" minGramSize="1"/>
                <filter class="solr.PatternReplaceFilterFactory"
pattern="([^\w\d\*æøåÆØÅ ])" replacement="" replace="all"/>
            </analyzer>
            <analyzer type="query">
                <charFilter class="solr.MappingCharFilterFactory"
mapping="mapping-ISOLatin1Accent.txt"/>
                <tokenizer class="solr.StandardTokenizerFactory"/>
                <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="0" generateNumberParts="0" catenateWords="0"
catenateNumbers="0" catenateAll="0" splitOnCaseChange="0"/>
                <filter class="solr.LowerCaseFilterFactory"/>
                <filter class="solr.PatternReplaceFilterFactory"
pattern="([^\w\d\*æøåÆØÅ ])" replacement="" replace="all"/>
                <filter class="solr.PatternReplaceFilterFactory"
pattern="^(.{20})(.*)?" replacement="$1" replace="all"/>
            </analyzer>
        </fieldType>

3. Start solr and in analysis page add "Test" to Field Value (Index)
-field and check the output.
4. Then change to this in solrconfig.xml

  <luceneMatchVersion>LUCENE_43</luceneMatchVersion>

5. reload the core and reload the analyis page.
6. you will now see that the end-positions are correct.



Any ideas on how to make this work with Solr 4.10.2 without resorting
to changing lucene version in solrconfig.xml?


Thanks,
Bjørn

(Edge)NGramFilterFactory and highlight

Reply via email to