On Sat, Mar 1, 2008 at 4:22 PM, Brian Whitman <[EMAIL PROTECTED]> wrote: > Once in a while we get this > > javax.xml.stream.XMLStreamException: ParseError at [row,col]:[4,790470] > [14:32:21.877] Message: An invalid XML character (Unicode: 0x6) was [...] > Our data comes from all sorts of places and although we've tried to be > utf8 wherever we can, there are still cracks.
The issue is that unfortunately XML cannot represent full unicode (it prohibits some values). This means even if they are escaped... so  will cause the XML parser to throw an exception. $ echo '<foo></foo>' | xmllint - -:1: parser error : xmlParseCharRef: invalid xmlChar value 6 <foo></foo> > I would much rather a document get added with replacement character > than to have this error prevent the addition of 8K documents (as has > happened here, this one character was in a 8K <add><doc>..<doc... run, > and only the ones before this character were added.) > > Is there something I can do on the solr side to ignore/replace invalid > characters? Since it's the XML parser, not really. If your documents are basic (no index-time boost, fixed fields), you could try using CSV. You could also scan for such chars on the client side before the XML is produced. -Yonik