Hi,
I found a strange behavior with the MappingCharFilterFactory in Solr
*6.2.1*. Definitely curious if I'm missing something or someone else met
that.
I have a (index and query) chain composed as follows:
<charFilter class="solr.MappingCharFilterFactory"
mapping="mapping-FoldToASCII.txt"/>
<tokenizer class="solr.KeywordTokenizerFactory" />
...
The mapping-FoldToASCII.txt is the exact file that you can find in the
Solr download bundle, I didn't add any mapping.
I started having some search issues and after checking, I saw that some
characters with diacritics weren't replaced. I isolated one of those
cases and tried to see what's happen in the analysis page.
As expected, the characters weren't replaced so I tried char by char.
Nothing, it doesn't work.
An example
I pasted īà in the "Field Value (Index)" box. The *ī* char is the
unicode *\u012b* which is already mapped in the mapping-FoldToASCII.txt
Without the "Verbose Output" flag [1]
* I see an empty space beside the MCF (where instead I'd expect to see
the "i", "a" replaced characters)
* the KeywordTokenizer reports exactly my input "īà" so it seems the
MCF didn't make any change to the source input
However, if I turn the "Verbose Output" flag on [2]
* You can see that the MCF is working (i.e. ī becomes i, and à becomes a)
* But the KeywordTokenizer is still ignoring that and it produces īà
I tried the same with a Solr 4.7.1 instance and as you can see [3] it
works as I would expect
Any help would be warmly appreciated
Best,
Andrea
[1] https://drive.google.com/file/d/0B82QaJKoMzvWb3dLcW80ME5wdXc/view
[2] https://drive.google.com/file/d/0B82QaJKoMzvWN2lNSF9JQUhPZ3c/view
[3] https://drive.google.com/file/d/0B82QaJKoMzvWeHRzUnU3MGFtY2s/view