Once in a while we get this
javax.xml.stream.XMLStreamException: ParseError at [row,col]:[4,790470]
[14:32:21.877] Message: An invalid XML character (Unicode: 0x6) was
found in the element content of the document.
[14:32:21.877] at
com
.sun
.org
.apache
.xerces
.internal.impl.XMLStreamReaderImpl.next(XMLStreamReaderImpl.java:588)
[14:32:21.877] at
org
.apache
.solr
.handler.XmlUpdateRequestHandler.readDoc(XmlUpdateRequestHandler.java:
318)
[14:32:21.877] at
org
.apache
.solr
.handler
.XmlUpdateRequestHandler.processUpdate(XmlUpdateRequestHandler.java:195)
...
Our data comes from all sorts of places and although we've tried to be
utf8 wherever we can, there are still cracks.
I would much rather a document get added with replacement character
than to have this error prevent the addition of 8K documents (as has
happened here, this one character was in a 8K <add><doc>..<doc... run,
and only the ones before this character were added.)
Is there something I can do on the solr side to ignore/replace invalid
characters?