On Sat, Mar 1, 2008 at 4:22 PM, Brian Whitman <[EMAIL PROTECTED]> wrote:
> Once in a while we get this
>
>  javax.xml.stream.XMLStreamException: ParseError at [row,col]:[4,790470]
>  [14:32:21.877] Message: An invalid XML character (Unicode: 0x6) was
[...]
>  Our data comes from all sorts of places and although we've tried to be
>  utf8 wherever we can, there are still cracks.

The issue is that unfortunately XML cannot represent full unicode (it
prohibits some values).
This means even if they are escaped... so &#6; will cause the XML
parser to throw an exception.

$ echo '<foo>&#6;</foo>' | xmllint -
-:1: parser error : xmlParseCharRef: invalid xmlChar value 6
<foo>&#6;</foo>


>  I would much rather a document get added with replacement character
>  than to have this error prevent the addition of 8K documents (as has
>  happened here, this one character was in a 8K <add><doc>..<doc... run,
>  and only the ones before this character were added.)
>
>  Is there something I can do on the solr side to ignore/replace invalid
>  characters?

Since it's the XML parser, not really.

If your documents are basic (no index-time boost, fixed fields), you
could try using CSV.
You could also scan for such chars on the client side before the XML
is produced.

-Yonik

Reply via email to