On Mar 5, 2013, at 3:50 PM, Chris Hostetter <hossman_luc...@fucit.org> wrote: > : I get that it would fold an accented character into the non-accented > : character, that's a prime reason why I use it ... but it's taking the accent > : as a standalone character (like ` and ¨) and just getting rid of it > entirely. > : That seems a little odd. > > Isn't that part of the point though? to normalize diacritics that might > come as independent characters due to keyboard/charset limitations of the end > user? > > So "tréma", "tre'ma" and "trema" all get normalized the same way?
The source of this folding is DiacriticFolding.txt, a Lucene-maintained file: <https://svn.apache.org/repos/asf/lucene/dev/tags/lucene_solr_4_1_0/lucene/analysis/icu/src/data/utr30/DiacriticFolding.txt> The syntax is: input-code-point-or-range ">" output-code-point-sequence (optional) The first two rules are to drop carets (005E) "^" and backticks (0060) "`". However, to be consistent with the idea that these characters, which you assert to be dual-pupose (i.e., both standalone and combining characters), should be stripped by an accent stripping component, then I think at least two other ASCII chars should be included: single-quote "'" and tilde "~". But they're not. So, as expected, this line fails when I add it to TestICUFoldingFilter: assertAnalyzesTo(a, "`something", new String[] { "`something" }); But these two succeed: assertAnalyzesTo(a, "tre'ma", new String[] { "tre'ma" }); assertAnalyzesTo(a, "pen~a", new String[] { "pen~a" }); I'm not sure if backticks should be folded away by ICUFoldingFilter, but if they are, then I think we should be consistent and also fold away other characters like it. Steve