Re: Highlight with NGram and German S Sharp "ß"

Scott Stults Tue, 20 Oct 2015 13:57:07 -0700

Yep, I misunderstood the problem.

The multiple tokens at the same offset might be messing things up. One
thing you can do is copyField to a field that doesn't have n-grams and do
something like f.textng.hl.alternateField= in your solrconfig. That'll use
the other field during highlighting. Yeah, that'll increase your index size
on disk.




On Fri, Oct 16, 2015 at 10:07 AM, Jérôme Bernardes <
jerome.bernar...@mappy.com> wrote:

> Thanks for your reply Scott.
>
> I tried
>
> bs.language=de&bs.country=de
>
> Unfortunately the problem still occurs.
> I have just discovered that the problem does not only affect "ß" but also
> "æ" (which is mapped to "ae"
> at query and index time)
> q=hae   -->   <em>hæna<em>
> So it seems to me that the problem is related to any single character that
> is map to several characters using <charFilter
> class="solr.MappingCharFilterFactory"
> mapping="mapping-ISOLatin1Accent.txt"/>
>
> Jérôme
>
>
> Le 13/10/2015 07:46, Scott Stults a écrit :
>
>> My guess is that the boundary scanner isn't configured right for your
>> highlighter. Try setting the bs.language and bs.country parameters either
>> in your request or in the requestHandler.
>>
>>
>> k/r,
>> Scott
>>
>> On Mon, Oct 5, 2015 at 4:57 AM, Jérôme Bernardes <
>> jerome.bernar...@mappy.com
>>
>>> wrote:
>>> Dear Solr Users,
>>> I am facing a problem with highligting on ngram fields.
>>> Highlighting is working well, except for words with german character
>>> "ß".
>>> Eg : with q=rosen&
>>> "highlighting": {
>>>          "gcl3r:12723710:6643": {
>>>              "textng": [
>>>                  "<em>Rosen</em>steinpark (Métro), Stuttgart (Allemagne)"
>>>              ]
>>>          },
>>>          "gcl3r:2267495:780930": {
>>>              "textng": [
>>>                  "<em>Rosenstraße</em>, 94554 Moos (Allemagne)"
>>>              ]
>>>          }
>>>      }
>>> Without "ß" words are highlight partially <em>Rosen</em>steinpark but
>>> with "ß", the whole word is highlighted (<em>Rosenstraße</em>)
>>>
>>> -------------
>>> This characters ß is mapped to "ss" at query and index time (using
>>> <charFilter class="solr.MappingCharFilterFactory"
>>> mapping="mapping-ISOLatin1Accent.txt"/>
>>>
>>> )
>>> .
>>> Here the schema.xml for the highlighted field.
>>> <fieldType name="autocomplete_ngram" class="solr.TextField">
>>>    <analyzer type="index">
>>>      <charFilter class="solr.MappingCharFilterFactory"
>>> mapping="mapping-ISOLatin1Accent.txt"/>
>>>      <!--<tokenizer class="solr.StandardTokenizerFactory"/>-->
>>>                  <tokenizer class="solr.PatternTokenizerFactory"
>>> pattern="[\s,;:
>>> \-\']"/>
>>>      <filter class="solr.WordDelimiterFilterFactory"
>>>          splitOnNumerics="0"
>>>          generateWordParts="1"
>>>          generateNumberParts="1"
>>>          catenateWords="0"
>>>          catenateNumbers="0"
>>>          catenateAll="0"
>>>          splitOnCaseChange="1"
>>>          preserveOriginal="1"
>>>          types="wdfftypes.txt"
>>>          />
>>>      <filter class="solr.LowerCaseFilterFactory"/>
>>>      <filter class="solr.SynonymFilterFactory" synonyms="synonym.txt"
>>> ignoreCase="true" expand="true"/>
>>>      <filter class="solr.EdgeNGramFilterFactory" maxGramSize="20"
>>> minGramSize="1"/>
>>>      <filter class="solr.PatternReplaceFilterFactory" pattern="([^\w\d
>>> \*&æøåÆØÅ ])" replacement="" replace="all"/>
>>>    </analyzer>
>>>    <analyzer type="query">
>>>      <charFilter class="solr.MappingCharFilterFactory"
>>> mapping="mapping-ISOLatin1Accent.txt"/>
>>>      <!--<tokenizer class="solr.StandardTokenizerFactory"/>-->
>>>                  <tokenizer class="solr.PatternTokenizerFactory"
>>> pattern="[\s,;:
>>> \-\']"/>
>>>      <filter class="solr.WordDelimiterFilterFactory"
>>>          splitOnNumerics="0"
>>>          generateWordParts="1"
>>>          generateNumberParts="0"
>>>          catenateWords="0"
>>>          catenateNumbers="0"
>>>          catenateAll="0"
>>>          splitOnCaseChange="0"
>>>          preserveOriginal="1"
>>>          types="wdfftypes.txt"
>>>          />
>>>      <filter class="solr.LowerCaseFilterFactory"/>
>>>      <filter class="solr.PatternReplaceFilterFactory" pattern="([^\w\d
>>> \*&æøåÆØÅ ])" replacement="" replace="all"/>
>>>      <filter class="solr.PatternReplaceFilterFactory"
>>> pattern="^(.{20})(.*)?" replacement="$1" replace="all"/>
>>>    </analyzer>
>>> </fieldType>
>>>
>>> Is it a problem in our configuration or a known bug ?
>>> Regards
>>> Jérôme
>>>
>>>
>>>
>>
>


-- 
Scott Stults | Founder & Solutions Architect | OpenSource Connections, LLC
| 434.409.2780
http://www.opensourceconnections.com

Re: Highlight with NGram and German S Sharp "ß"

Reply via email to