I'm indexing some mail archives and within the various formats/ encodings etc, some messages have invalid control characters.

doc.setField( "body", content.toString() );

In the solr logs, I get:

[java] SEVERE: java.io.IOException: Illegal character ((CTRL- CHAR, code 22))
     [java]  at [row,col {unknown-source}]: [758496,50]
[java] at org.apache.solr.handler.XMLLoader.load(XMLLoader.java: 73) [java] at org .apache .solr .handler .ContentStreamHandlerBase .handleRequestBody(ContentStreamHandlerBase.java:54) [java] at org .apache .solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java: 131)

Is there any standard way to escape invalid xml control characters?

If so, we should add that to XML.escapeCharData() -- this gets called from ClientUtils.writeXML()

It looks like the XML class already has something set for 22, so I'm not sure what could be happening.

I have also tried:

    StringBuilder body = content.toString()
    for( int i=0; i<body.length(); i++ ) {
      int c = body.charAt( i );
if( c < ' ' && c != 9 && c != 10 && c != 13 ) { // 9 = TAB, 10 = New Line, 13 = CR
        log.warn( "Contains invalid character: '"+c+"' " );
        // replace control character with space
        body.setCharAt( i, ' ' );
      }
    }
    doc.setField( "body", body.toString() );

but that still gives the same error.

Any ideas?
thanks
ryan


Reply via email to