Hi,
I found a strange behavior with the MappingCharFilterFactory in Solr *6.2.1*. Definitely curious if I'm missing something or someone else met that.

I have a (index and query) chain composed as follows:

<charFilter class="solr.MappingCharFilterFactory" mapping="mapping-FoldToASCII.txt"/>
<tokenizer class="solr.KeywordTokenizerFactory" />
...

The mapping-FoldToASCII.txt is the exact file that you can find in the Solr download bundle, I didn't add any mapping. I started having some search issues and after checking, I saw that some characters with diacritics weren't replaced. I isolated one of those cases and tried to see what's happen in the analysis page.

As expected, the characters weren't replaced so I tried char by char. Nothing, it doesn't work.
An example

I pasted īà in the "Field Value (Index)" box. The *ī* char is the unicode *\u012b* which is already mapped in the mapping-FoldToASCII.txt

Without the "Verbose Output" flag [1]

 * I see an empty space beside the MCF (where instead I'd expect to see
   the "i", "a" replaced characters)
 * the KeywordTokenizer reports exactly my input "īà" so it seems the
   MCF didn't make any change to the source input

However, if I turn the "Verbose Output" flag on [2]

 * You can see that the MCF is working (i.e. ī becomes i, and à becomes a)
 * But the KeywordTokenizer is still ignoring that and it produces īà

I tried the same with a Solr 4.7.1 instance and as you can see [3] it works as I would expect

Any help would be warmly appreciated

Best,
Andrea

[1] https://drive.google.com/file/d/0B82QaJKoMzvWb3dLcW80ME5wdXc/view
[2] https://drive.google.com/file/d/0B82QaJKoMzvWN2lNSF9JQUhPZ3c/view
[3] https://drive.google.com/file/d/0B82QaJKoMzvWeHRzUnU3MGFtY2s/view

Reply via email to