[GitHub] [lucene] thomasschuerger opened a new issue, #11733: Provide a version of GermanNormalizationFilter that uses a modified Umlaut mapping

GitBox Thu, 01 Sep 2022 11:02:32 -0700


thomasschuerger opened a new issue, #11733:
URL: https://github.com/apache/lucene/issues/11733


   ### Description
   
   The GermanNormalizationFilter includes the following mappings: ä/ae -> a, 
ö/oe -> o, ü/ue -> u and ß -> ss (plus some simple rules when "ue" should not 
be converted to "u"). This mapping is very uncommon in German. In German, it is 
common to treat ä and ae, ö and oe, ü and ue, as well as ß and ss as equivalent 
(the ASCII versions are used in cases where you cannot use the non-ASCII 
characters, e.g. when using an English keyboard or when the system doesn't 
allow these characters). With this mapping, searching for "Uber" (the company) 
finds the frequent word "über", which is unexpected, because "u" and "ü" are 
(normally) not treated as equivalent.
   
   Therefore I would like to see a filter that normalizes German by mapping 
ä->ae, ö->oe, ü->ue and ß->ss, either by an additional parameter for 
GermanNormalizationFilter which switches to that mapping (the previous mapping 
should of course be the default), or by having a separate filter 
(GermanNormalizationFilter2?) with that mapping.
   
   Using a charfilter is not the same, as this is done before the whole filter 
chain. The new filter should be a drop-in replacement for 
GermanNormalizationFilter in any position in the filter chain.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] thomasschuerger opened a new issue, #11733: Provide a version of GermanNormalizationFilter that uses a modified Umlaut mapping

Reply via email to