: You can not do this in Solr, you cannot even send non-character code : points in the first place. For Apache Nutch we solved the problem by
Strictly speak: this is false. You *can* send control characters to solr as field values -- assuming your transport format allows it. Example: using javabin to send SolrInputDocuments from a SolrJ client doesn't care if the field value Strings have control characters in them. Likewise it should be possible to send many control characters when using JSON formatted updates -- let alone using something like DIH to pull blog data from a DB, or the Extracting Request handler which might find control-characters in MS-Word of PDF docs. In all of those cases, an UpdateProcessor to strip out hte unwanted characters can/will work well. In the specific case discussed in this thread (based on the eventual stack trace posted) and UpdateProcessor witll *not* work because the fundemental problem is that the control characters in question mean that the "XML-ish" lookin bytes being sent to Solr by the client are not actually valid XML -- because by definition XML can not contain those invalid control-characters. -Hoss http://www.lucidworks.com/