On Thu, 2007-09-27 at 13:26 -0400, J.J. Larrea wrote: > At 12:13 PM -0400 9/27/07, Steven Rowe wrote: > >Chris Hostetter wrote: ... > As for implementation, the first part could easily and flexibly accomplished > with the current PatternReplaceFilter, and I'm thinking the second could be > done with an extension to that or better yet a new Filter which allows > parsing synonymous tokens from a flat to overlaid format, e.g. something on > the order of: > > <filter class="solr.PatternReplaceFilterFactory" > pattern="(.*)(ü|ue)(.*)" > replacement="$1ue$3|$1u$3" > tokensep="|" <!-- not currently implemented --> > replace="first"/> > > or perhaps better, > > <filter class="solr.PatternReplaceFilterFactory" > pattern="(.*)(ü|ue)(.*)" > replacement="$1ue$3|$1u$3" > replace="first"/> > <filter class="solr.OverlayTokenFilterFactory" > tokensep="|"/> <!-- not currently implemented --> > > which in my fantasy implementation would map: > > Müller -> Mueller|Muller > Mueller -> Mueller|Muller > Muller -> Muller > > and could be run at index-time and/or query-time as appropriate. > > >Does anyone know if there are other (Latin-1-utilizing) languages > >besides German with standardized diacritic substitutions that involve > >something other than just stripping the diacritics? > > I'm curious about this too. >
I am German, but working in Spain so I have not faced the problem so far. Anyhow, IMO Müller -> Mueller Mueller -> Mueller is right to further shorten the word does not seems right since one is changing the meaning too much. Further: groß -> gross gross -> gross ß is pronounced 'sz' but only replaced by 'ss'. salu2 > - J.J. -- Thorsten Scherler thorsten.at.apache.org Open Source Java consulting, training and solutions