I'm indexing some mail archives and within the various formats/
encodings etc, some messages have invalid control characters.
doc.setField( "body", content.toString() );
In the solr logs, I get:
[java] SEVERE: java.io.IOException: Illegal character ((CTRL-
CHAR, code 22))
[java] at [row,col {unknown-source}]: [758496,50]
[java] at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:
73)
[java] at
org
.apache
.solr
.handler
.ContentStreamHandlerBase
.handleRequestBody(ContentStreamHandlerBase.java:54)
[java] at
org
.apache
.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:
131)
Is there any standard way to escape invalid xml control characters?
If so, we should add that to XML.escapeCharData() -- this gets called
from ClientUtils.writeXML()
It looks like the XML class already has something set for 22, so I'm
not sure what could be happening.
I have also tried:
StringBuilder body = content.toString()
for( int i=0; i<body.length(); i++ ) {
int c = body.charAt( i );
if( c < ' ' && c != 9 && c != 10 && c != 13 ) { // 9 = TAB, 10
= New Line, 13 = CR
log.warn( "Contains invalid character: '"+c+"' " );
// replace control character with space
body.setCharAt( i, ' ' );
}
}
doc.setField( "body", body.toString() );
but that still gives the same error.
Any ideas?
thanks
ryan