The problem is that even though unicode point \uFFFF and \uFFFE are valid
UTF-8 characters, they will not be parsed by standards conforming XML
parsers. There is something called UTF-8 replacement character \uFFFD that
can be used to replace such characters. While indexing docs, replace all
such characters with \uFFFD and Solr handles these characters well.

-Shankar





On 8/5/13 12:09 PM, "Robert Muir" <rcm...@gmail.com> wrote:

>On Mon, Aug 5, 2013 at 3:03 PM, Chris Hostetter
><hossman_luc...@fucit.org> wrote:
>>
>> : > 0xfffe is not a special character -- it is explicitly *not* a
>>character in
>> : > Unicode at all, it is set asside as "not a character." specifically
>>so
>> : > that the character 0xfeff can be used as a BOM, and if the BOM is
>>read
>> : > incorrectly, it will cause an error.
>> :
>> : XML doesnt allow control character like this, it defines character as:
>>
>> But is that even relevant?  I thought FFFE was *not* a control
>>character?
>> I thought it was completely invaid in Unicode.
>>
>
>its totally relevant. FFFE is a unicode codepoint, but its a noncharacter.
>
>Its just that XML disallows FFFE and FFFF noncharacters, but allows
>other noncharacters (like 9FFFF)
>These are "allowed but discouraged": http://www.w3.org/TR/xml11/#charsets
>


Reply via email to