Re: ICUFoldingFilter with swedish characters, and tokens with the keyword attribute?

Erick Erickson Tue, 10 Jan 2017 04:59:30 -0800

Jimi:

The critical line for the KeywordRepeatFilter is "This is useful if
used with a stem filter that respects the KeywordAttribute to index
the stemmed and the un-stemmed version of a term into the same
field.". There is no guarantee that all filters _after_ the
KeywordRepeatFilter respect the keyword attribute.


Doesn't seem like it would be difficult to add if you'd like to submit
a patch though.

But that's not the most important bit. Have you considered something
like MappingCharFitlerFactory? Unfortunately that's a charFilter which
transforms everything before it gets to the repeatFilter so you'd have
to use two fields.

Best,
Erick

On Tue, Jan 10, 2017 at 1:02 AM,  <jimi.hulleg...@svensktnaringsliv.se> wrote:
> Hi,
>
> I wasn't happy with how our current solr configuration handled diacritics 
> (like 'é') in the text and in search queries, since it simply considered the 
> letter with a diacritic as a distinct letter. Ie 'é' didn't match 'e', and 
> vice versa. Except for a handful rare words where the diacritical sign in 'é' 
> actually change the word meaning, it is usually used in names of people and 
> places and the expected behaivor when searching is to not have to type them 
> and still get the expected results (like searching for 'Penelope Cruz' and 
> getting hits for 'Penélope Cruz').
>
> When reading online about how to handle diacritics in solr, it seems that the 
> general recommendation, when no language specific solution exists that 
> handles this, is to use the ICUFoldingFilter. However this filter doesn't 
> really come with a lot of documentation, and doesn't seem to have any 
> configuration options at all (at least not documented).
>
> So what I ended up with doing was simply to add the ICUFoldingFilterFactory 
> in the middle of the existing analyzer chain, like this:
>
> <fieldType name="text_sv" class="solr.TextField" positionIncrementGap="100">
>                              <analyzer>
>                                                           <charFilter 
> class="solr.HTMLStripCharFilterFactory" />
>                                                           <charFilter 
> class="solr.PatternReplaceCharFilterFactory" pattern="([.])" replacement=" " 
> />
>                                                           <tokenizer 
> class="solr.StandardTokenizerFactory" />
>                                                           <filter 
> class="solr.LowerCaseFilterFactory" />
>                                                           <filter 
> class="solr.KeywordRepeatFilterFactory" />
>                                                           <filter 
> class="solr.ICUFoldingFilterFactory"/>
>                                                           <filter 
> class="solr.SwedishLightStemFilterFactory" />
>                                                           <filter 
> class="solr.RemoveDuplicatesTokenFilterFactory" />
>                              </analyzer>
> </fieldType>
>
>
> But that didn't really give me the results I want. For example, using the 
> analysis debug tool I see that the text 'café åäö' becomes 'cafe caf aao'. 
> And there are two problems with that result:
>
> 1. It doesn't respect keyword attribute
> 2. It folds the Swedish characters 'åäö' into 'aao'
>
> The disregard of the keyword attribute is bad enough, but the mangling of the 
> Swedish language is really a show stopper for us. The Swedish language 
> doesn't consider 'ö', for example, to be the letter 'o' with two diacritical 
> dots above it, just as 'Q' isn't considered to be the letter 'O' with a 
> diacritical "squiggly line" at the bottom. So when handling Swedish text, 
> these characters ('åäöÅÄÖ') shouldn't be folded, because then there will be 
> to many "collisions".
>
> For example, when searching for 'påstå' ('claim'), one doesn't want hits 
> about 'pasta' (you guessed it, it means 'pasta'), just as one doesn't want to 
> get hits about 'aga' ('corporal punishment, usually against children') when 
> searching for 'äga' ('to own'). Or even worse, when searching för 'höra' ('to 
> hear'), one most likely doesn't want hits about 'hora' ('prostitute'). And I 
> can go on... :)
>
> So, is there a way for us to make the ICUFoldingFilter work in a better way? 
> Ie configure it to respect the keyword attribute and ignore 'åäö' characters 
> when folding, but otherwise fold all diacritical characters into the 
> non-diacritical form. Or how would you recommend us to configure our analyzer 
> chain to acomplice this?
>
> Regards
> /Jimi

Re: ICUFoldingFilter with swedish characters, and tokens with the keyword attribute?

Reply via email to