On Thu, 2007-09-27 at 13:26 -0400, J.J. Larrea wrote:
> At 12:13 PM -0400 9/27/07, Steven Rowe wrote:
> >Chris Hostetter wrote:
...
> As for implementation, the first part could easily and flexibly accomplished 
> with the current PatternReplaceFilter, and I'm thinking the second could be 
> done with an extension to that or better yet a new Filter which allows 
> parsing synonymous tokens from a flat to overlaid format, e.g. something on 
> the order of:
> 
>     <filter class="solr.PatternReplaceFilterFactory"
>      pattern="(.*)(ü|ue)(.*)"
>      replacement="$1ue$3|$1u$3"
>      tokensep="|"  <!-- not currently implemented -->
>      replace="first"/>
> 
> or perhaps better,
> 
>     <filter class="solr.PatternReplaceFilterFactory"
>      pattern="(.*)(ü|ue)(.*)"
>      replacement="$1ue$3|$1u$3"
>      replace="first"/>
>     <filter class="solr.OverlayTokenFilterFactory"
>      tokensep="|"/>   <!-- not currently implemented -->
> 
> which in my fantasy implementation would map:
> 
>     Müller -> Mueller|Muller
>     Mueller -> Mueller|Muller
>     Muller -> Muller
> 
> and could be run at index-time and/or query-time as appropriate.
> 
> >Does anyone know if there are other (Latin-1-utilizing) languages
> >besides German with standardized diacritic substitutions that involve
> >something other than just stripping the diacritics?
> 
> I'm curious about this too.
> 

I am German, but working in Spain so I have not faced the problem so
far. Anyhow, IMO 
Müller -> Mueller
Mueller -> Mueller

is right to further shorten the word does not seems right since one is
changing the meaning too much.

Further:
groß -> gross
gross -> gross

ß is pronounced 'sz' but only replaced by 'ss'.

salu2

> - J.J.
-- 
Thorsten Scherler                                 thorsten.at.apache.org
Open Source Java                      consulting, training and solutions

Reply via email to