Among several other foldings, ICUFoldingFilter performs the Unicode NFC 
transform, which consists of canonical decomposition (NFD) followed by 
canonical composition.  NFD transforms U+FA04 to U+5B85, and canonical 
composition leaves U+5B85 as-is.

U+FA04 is in the “Pronunciation variants from KS X 1001:1998" sub-block - KS X 
1001 is a Korean encoding standard - in the "CJK Compatibility Ideographs" 
block <http://www.unicode.org/charts/PDF/UF900.pdf>.  I don’t know why these 
variants were included in Unicode, but the NFD transform includes the 
compatibility->canonical tranform, so it’s likely many other compatibility 
characters in your data will be affected, not just this one.  If the 
compatibility->canonical tranform is problematic, why are you using 
ICUFoldingFilter?

If you like some of the foldings included in ICUFoldingFilter but not others, 
check out the “gennorm2” and “gen-utr30-data-files” targets in the Lucene/Solr 
source code at lucene/analysis/icu/build.xml - you could build and use a 
modified binary tranform data file - this file is distributed as part of the 
lucene-analyzers-icu jar at org/apache/lucene/analysis/icu/utr30.nrm.
 
--
Steve
www.lucidworks.com

> On Oct 30, 2016, at 10:29 AM, Ahmet Arslan <iori...@yahoo.com.INVALID> wrote:
> 
> Hi Eyal,
> 
> ICUFoldingFilter uses http://site.icu-project.org under the hood.
> If you think there is a bug, it is better to ask its mailing list.
> 
> Ahmet
> 
> 
> 
> On Sunday, October 30, 2016 3:41 PM, "eyal.naam...@exlibrisgroup.com" 
> <eyal.naam...@exlibrisgroup.com> wrote:
> Hi,
> 
> I was wondering if anyone ran into the following issue, or a similar one:
> In Han script there are two separate characters - 宅 (FA04) and 宅 (5B85).
> It seems that ICUFoldingFilter converts FA04 to 5B85, which results in the 
> wrong character being indexed.
> Does anyone have any idea if and how this can be resolved? Is there an option 
> to add an exception rule to ICUFoldingFilter?
> Thanks,
> Eyal

Reply via email to