AW: Antw: Re: Correct order of mappinCharFilter, Tokenizer and GermanStemFilter

Tobias Ibounig Sat, 20 Jul 2019 00:01:02 -0700

Well the Stemmer only works generalized rules. That these rules sometimes (or 
surprisingly often) do not result in the action word stem. The question is how 
much does it matter in your case. When you search for something, the same 
transformations are applied. So searching for "München", "Muenchen" or 
"Munchen" you will always get the result.

If you want to do less Stemming you can use the Light or Minimal variants, or 
you can use a list of words which you do not want to stem by putting the 
KeywordMarkerFilter before the Stemming.

All the Best
Tobias
________________________________
Von: Doris Peter <[email protected]>
Gesendet: Freitag, 19. Juli 2019 13:48:14
An: [email protected] <[email protected]>
Betreff: RE: Antw: Re: Correct order of mappinCharFilter, Tokenizer and 
GermanStemFilter

Yes, you are right, we should discuss this once more ....
But we have texts, which contain e.g. Muenchen. And we would like to retrieve 
these documents too, when searching for "München". We would loose them if we 
would map 'München' to 'Munchen'.
On the other hand, we get in trouble with the wildcard '?' when we map ü to ue 
:-(

Anyway, I tried it without any mapping and still the GermanStemFilterFactory 
doesn't work as expected, it turns 'häuser' into 'hau', not into 'haus' :-/

>>> Tobias Ibounig <[email protected]> 7/19/2019 11:54 AM >>>
Hi Doris,

Are you sure you want 'ä' --> 'ae'
If you check, the German stemmers usually substitute ä --> a (to "reduce over 
stemming" [1]), so you would be working against the stemmers logic here.

If you take a look at the GermanNormalizationFilter, it even substitutes 'ae' 
with 'a' [2].

Would recommend to use the default evaluable tools if you don't have a specific 
requirement against it.

All the Best
Tobias

[1] 
https://github.com/apache/lucene-solr/blob/master/lucene/analysis/common/src/java/org/apache/lucene/analysis/de/GermanStemmer.java#L164

[2] 
https://github.com/apache/lucene-solr/blob/master/lucene/analysis/common/src/java/org/apache/lucene/analysis/de/GermanNormalizationFilter.java#L31

-----Original Message-----
From: Doris Peter <[email protected]>
Sent: Freitag, 19. Juli 2019 11:13
To: [email protected]
Subject: Antw: Re: Correct order of mappinCharFilter, Tokenizer and 
GermanStemFilter

Thanks for the answer. I examined the  ICUFoldingFilterFactory, but it seems to 
me, that it can't be customized the way I would need it.
We have got some special foldings, e.g.: ä->ae. In the CharFilter, I can add it 
to the following file: "mapping="mapping-FoldToASCII.txt"
There seems to be nothing like this mapping file in the ICUFoldingFilter? 
Exclusion is not enough ....

>>> Shawn Heisey <[email protected]> 7/18/2019 3:08 PM >>>
On 7/18/2019 3:01 AM, Doris Peter wrote:
> So, the mappingCharFilter seems to be executed at first, no matter which 
> position it has in the configuration?

CharFilters are always executed first.  Then one Tokenizer, then Filters.  This 
will always be the case, even if you order the config so that the Tokenizer and 
one or more Filters are listed before CharFilter entries.  It's one of the 
quirks of analysis definitions.

The fix for this would be to see if there is a regular Filter that does what 
the CharFilter you're using does and use that filter instead.

If it were me, I would likely use ICUFoldingFilterFactory rather than 
MappingCharFilterFactory.  The ICU analysis components do require installing 
contrib jars into Solr.

https://lucene.apache.org/solr/guide/8_1/filter-descriptions.html#icu-folding-filter

Thanks,
Shawn

AW: Antw: Re: Correct order of mappinCharFilter, Tokenizer and GermanStemFilter

Reply via email to