escaping/removing control characters?

Ryan McKinley Sat, 13 Dec 2008 10:46:05 -0800

I'm indexing some mail archives and within the various formats/encodings etc, some messages have invalid control characters.


doc.setField( "body", content.toString() );


In the solr logs, I get:

[java] SEVERE: java.io.IOException: Illegal character ((CTRL-CHAR, code 22))

     [java]  at [row,col {unknown-source}]: [758496,50]

[java] at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:73)[java] atorg.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54)[java] atorg.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)


Is there any standard way to escape invalid xml control characters?

If so, we should add that to XML.escapeCharData() -- this gets calledfrom ClientUtils.writeXML()

It looks like the XML class already has something set for 22, so I'mnot sure what could be happening.


I have also tried:

    StringBuilder body = content.toString()
    for( int i=0; i<body.length(); i++ ) {
      int c = body.charAt( i );

if( c < ' ' && c != 9 && c != 10 && c != 13 ) { // 9 = TAB, 10= New Line, 13 = CR

        log.warn( "Contains invalid character: '"+c+"' " );
        // replace control character with space
        body.setCharAt( i, ' ' );
      }
    }
    doc.setField( "body", body.toString() );

but that still gives the same error.

Any ideas?
thanks
ryan

escaping/removing control characters?

Reply via email to