The problem is that even though unicode point \uFFFF and \uFFFE are valid UTF-8 characters, they will not be parsed by standards conforming XML parsers. There is something called UTF-8 replacement character \uFFFD that can be used to replace such characters. While indexing docs, replace all such characters with \uFFFD and Solr handles these characters well.
-Shankar On 8/5/13 12:09 PM, "Robert Muir" <rcm...@gmail.com> wrote: >On Mon, Aug 5, 2013 at 3:03 PM, Chris Hostetter ><hossman_luc...@fucit.org> wrote: >> >> : > 0xfffe is not a special character -- it is explicitly *not* a >>character in >> : > Unicode at all, it is set asside as "not a character." specifically >>so >> : > that the character 0xfeff can be used as a BOM, and if the BOM is >>read >> : > incorrectly, it will cause an error. >> : >> : XML doesnt allow control character like this, it defines character as: >> >> But is that even relevant? I thought FFFE was *not* a control >>character? >> I thought it was completely invaid in Unicode. >> > >its totally relevant. FFFE is a unicode codepoint, but its a noncharacter. > >Its just that XML disallows FFFE and FFFF noncharacters, but allows >other noncharacters (like 9FFFF) >These are "allowed but discouraged": http://www.w3.org/TR/xml11/#charsets >