Among several other foldings, ICUFoldingFilter performs the Unicode NFC transform, which consists of canonical decomposition (NFD) followed by canonical composition. NFD transforms U+FA04 to U+5B85, and canonical composition leaves U+5B85 as-is.
U+FA04 is in the “Pronunciation variants from KS X 1001:1998" sub-block - KS X 1001 is a Korean encoding standard - in the "CJK Compatibility Ideographs" block <http://www.unicode.org/charts/PDF/UF900.pdf>. I don’t know why these variants were included in Unicode, but the NFD transform includes the compatibility->canonical tranform, so it’s likely many other compatibility characters in your data will be affected, not just this one. If the compatibility->canonical tranform is problematic, why are you using ICUFoldingFilter? If you like some of the foldings included in ICUFoldingFilter but not others, check out the “gennorm2” and “gen-utr30-data-files” targets in the Lucene/Solr source code at lucene/analysis/icu/build.xml - you could build and use a modified binary tranform data file - this file is distributed as part of the lucene-analyzers-icu jar at org/apache/lucene/analysis/icu/utr30.nrm. -- Steve www.lucidworks.com > On Oct 30, 2016, at 10:29 AM, Ahmet Arslan <iori...@yahoo.com.INVALID> wrote: > > Hi Eyal, > > ICUFoldingFilter uses http://site.icu-project.org under the hood. > If you think there is a bug, it is better to ask its mailing list. > > Ahmet > > > > On Sunday, October 30, 2016 3:41 PM, "eyal.naam...@exlibrisgroup.com" > <eyal.naam...@exlibrisgroup.com> wrote: > Hi, > > I was wondering if anyone ran into the following issue, or a similar one: > In Han script there are two separate characters - 宅 (FA04) and 宅 (5B85). > It seems that ICUFoldingFilter converts FA04 to 5B85, which results in the > wrong character being indexed. > Does anyone have any idea if and how this can be resolved? Is there an option > to add an exception rule to ICUFoldingFilter? > Thanks, > Eyal