On Wed, Feb 19, 2014 at 10:33 AM, Thomas Fischer <fischer...@aon.at> wrote:

>
> > Hmm, for standardization of text fields, collation might be a little
> > awkward.
>
> I arrived there after using custom rules for a while (see
> "RuleBasedCollator" on http://wiki.apache.org/solr/UnicodeCollation) and
> then being told
> "For better performance, less memory usage, and support for more locales,
> you can add the analysis-extras contrib and use
> ICUCollationKeyFilterFactory instead." (on the same page under "ICU
> Collation").
>
> > For your german umlauts, what do you mean by standardize? is this to
> > achieve equivalency of e.g. oe to ö in your search terms?
>
> That is the main point, but I might also need the additional normalization
> of combined characters like
> o+  ̈ = ö and probably similar constructions for other languages (like
> Hungarian).
>

Sure but using collation to get normalization is pretty overkill too. Maybe
try ICUNormalizer2Filter? This gives you better control over the
normalization anyway.


>
> > In that case, a simpler approach would be to put
> > GermanNormalizationFilterFactory in your chain:
> >
> http://lucene.apache.org/core/4_6_1/analyzers-common/org/apache/lucene/analysis/de/GermanNormalizationFilter.html
>
> I'll see how far I get with this, but from the description
>         • 'ä', 'ö', 'ü' are replaced by 'a', 'o', 'u', respectively.
>         • 'ae' and 'oe' are replaced by 'a', and 'o', respectively.
> this seems to be too far-reaching a reduction: while the identification
> "ä=ae" is not very serious and rarely misleading, "ä=a" might pack words
> together that shouldn't be, "Äsen" and "Asen" are quite different concepts,
>

I'm not sure thats a mainstream opinion: not only do the default german
collation rules conflate these two characters as equivalent at primary
level, but so do many german stemming algorithms. Similar arguments could
be made for 'résumé' versus 'resume' and so on. Search isn't an exact
science.

Reply via email to