Jimi: The critical line for the KeywordRepeatFilter is "This is useful if used with a stem filter that respects the KeywordAttribute to index the stemmed and the un-stemmed version of a term into the same field.". There is no guarantee that all filters _after_ the KeywordRepeatFilter respect the keyword attribute.
Doesn't seem like it would be difficult to add if you'd like to submit a patch though. But that's not the most important bit. Have you considered something like MappingCharFitlerFactory? Unfortunately that's a charFilter which transforms everything before it gets to the repeatFilter so you'd have to use two fields. Best, Erick On Tue, Jan 10, 2017 at 1:02 AM, <jimi.hulleg...@svensktnaringsliv.se> wrote: > Hi, > > I wasn't happy with how our current solr configuration handled diacritics > (like 'é') in the text and in search queries, since it simply considered the > letter with a diacritic as a distinct letter. Ie 'é' didn't match 'e', and > vice versa. Except for a handful rare words where the diacritical sign in 'é' > actually change the word meaning, it is usually used in names of people and > places and the expected behaivor when searching is to not have to type them > and still get the expected results (like searching for 'Penelope Cruz' and > getting hits for 'Penélope Cruz'). > > When reading online about how to handle diacritics in solr, it seems that the > general recommendation, when no language specific solution exists that > handles this, is to use the ICUFoldingFilter. However this filter doesn't > really come with a lot of documentation, and doesn't seem to have any > configuration options at all (at least not documented). > > So what I ended up with doing was simply to add the ICUFoldingFilterFactory > in the middle of the existing analyzer chain, like this: > > <fieldType name="text_sv" class="solr.TextField" positionIncrementGap="100"> > <analyzer> > <charFilter > class="solr.HTMLStripCharFilterFactory" /> > <charFilter > class="solr.PatternReplaceCharFilterFactory" pattern="([.])" replacement=" " > /> > <tokenizer > class="solr.StandardTokenizerFactory" /> > <filter > class="solr.LowerCaseFilterFactory" /> > <filter > class="solr.KeywordRepeatFilterFactory" /> > <filter > class="solr.ICUFoldingFilterFactory"/> > <filter > class="solr.SwedishLightStemFilterFactory" /> > <filter > class="solr.RemoveDuplicatesTokenFilterFactory" /> > </analyzer> > </fieldType> > > > But that didn't really give me the results I want. For example, using the > analysis debug tool I see that the text 'café åäö' becomes 'cafe caf aao'. > And there are two problems with that result: > > 1. It doesn't respect keyword attribute > 2. It folds the Swedish characters 'åäö' into 'aao' > > The disregard of the keyword attribute is bad enough, but the mangling of the > Swedish language is really a show stopper for us. The Swedish language > doesn't consider 'ö', for example, to be the letter 'o' with two diacritical > dots above it, just as 'Q' isn't considered to be the letter 'O' with a > diacritical "squiggly line" at the bottom. So when handling Swedish text, > these characters ('åäöÅÄÖ') shouldn't be folded, because then there will be > to many "collisions". > > For example, when searching for 'påstå' ('claim'), one doesn't want hits > about 'pasta' (you guessed it, it means 'pasta'), just as one doesn't want to > get hits about 'aga' ('corporal punishment, usually against children') when > searching for 'äga' ('to own'). Or even worse, when searching för 'höra' ('to > hear'), one most likely doesn't want hits about 'hora' ('prostitute'). And I > can go on... :) > > So, is there a way for us to make the ICUFoldingFilter work in a better way? > Ie configure it to respect the keyword attribute and ignore 'åäö' characters > when folding, but otherwise fold all diacritical characters into the > non-diacritical form. Or how would you recommend us to configure our analyzer > chain to acomplice this? > > Regards > /Jimi