: I am indexing Solr 4.9.0 using the /update request handler and am getting : errors from Tika - Illegal IOException from : org.apache.tika.parser.xml.DcXMLParser@74ce3bea which is caused by : MalFormedByteSequenceException: Invalid byte 1 of 1-byte UTF-8 sequence. I
FWIW: that error appears to have come from /update/extract .. hard to be sure w/o full stack trace from the logs ... but i'll assume that's just a copy/paste mistake from the second test you mentioned trying, and assume your assessment is correct... : believe that this is the result of attempting to pass information to Solr : via CURL as XML in which the data has non UTF characters such as Smart : Quotes (the irony of that name is amazing). So when I: ...and focus on the example command you mentioned... : curl http://10.0.0.10/solr/pp/update?commit=true -H "Content-Type: text/xml" : --data-binary "<add><doc><field name=\"id\">123456</field><field : name=\"observation\">This is some text that was passed from the .NET : application to Solr for indexing. Users typically write in Word then copy : and paste into the .NET application UI which then passes everything to Solr : for indexing. If there are "smart quotes" it crashes, but "regular quotes" : are fine.</field></doc></add>" if you tell solr you are sending it XML, then you have to send it valid XML. if you don't specify a charset -- either in the Content-Type, or in an XML prolog declaration -- then the XML spec says UTF-8 must be assumed. if the bytes in your doc aren't UTF-8, it's not a valid XML file, etc.... if you actually know what charset you are sending, then you can specify it -- and as long as your JVM implementation understands it, it should work. you can't however just read some raw bytes from somewhere, slap some xml-ish lookin strings in front & behind, and hope you have valid xml. if you use a good XML serialization library in your .Net application to generate the messages you send to Solr, then the serialization library should help mitigate this probem -- either by specifying the correct encoding in the xml prolog it generates for you in it's output, or by converting the input "strings" to utf-8, or by giving you a good error if/when you ask it to serialize characters that can't be serialized in XML (there are some, like null bytes and control sequence). -Hoss http://www.lucidworks.com/