On Mar 5, 2013, at 3:50 PM, Chris Hostetter <hossman_luc...@fucit.org> wrote:
> : I get that it would fold an accented character into the non-accented
> : character, that's a prime reason why I use it ... but it's taking the accent
> : as a standalone character (like ` and ¨) and just getting rid of it 
> entirely.
> : That seems a little odd.
> 
> Isn't that part of the point though? to normalize diacritics that might 
> come as independent characters due to keyboard/charset limitations of the end 
> user?
> 
> So "tréma", "tre'ma" and "trema" all get normalized the same way?

The source of this folding is DiacriticFolding.txt, a Lucene-maintained file: 
<https://svn.apache.org/repos/asf/lucene/dev/tags/lucene_solr_4_1_0/lucene/analysis/icu/src/data/utr30/DiacriticFolding.txt>

The syntax is: input-code-point-or-range ">" output-code-point-sequence 
(optional)

The first two rules are to drop carets (005E) "^" and backticks (0060) "`".

However, to be consistent with the idea that these characters, which you assert 
to be dual-pupose (i.e., both standalone and combining characters), should be 
stripped by an accent stripping component, then I think at least two other 
ASCII chars should be included: single-quote "'" and tilde "~".  But they're 
not.

So, as expected, this line fails when I add it to TestICUFoldingFilter:

    assertAnalyzesTo(a, "`something", new String[] { "`something" });

But these two succeed:

    assertAnalyzesTo(a, "tre'ma", new String[] { "tre'ma" });
    assertAnalyzesTo(a, "pen~a", new String[] { "pen~a" });

I'm not sure if backticks should be folded away by ICUFoldingFilter, but if 
they are, then I think we should be consistent and also fold away other 
characters like it.

Steve

Reply via email to