Hello! I am indexing Solr 4.9.0 using the /update request handler and am getting errors from Tika - Illegal IOException from org.apache.tika.parser.xml.DcXMLParser@74ce3bea which is caused by MalFormedByteSequenceException: Invalid byte 1 of 1-byte UTF-8 sequence. I believe that this is the result of attempting to pass information to Solr via CURL as XML in which the data has non UTF characters such as Smart Quotes (the irony of that name is amazing). So when I:
curl http://10.0.0.10/solr/pp/update?commit=true -H "Content-Type: text/xml" --data-binary "<add><doc><field name=\"id\">123456</field><field name=\"observation\">This is some text that was passed from the .NET application to Solr for indexing. Users typically write in Word then copy and paste into the .NET application UI which then passes everything to Solr for indexing. If there are "smart quotes" it crashes, but "regular quotes" are fine.</field></doc></add>" I also tried /update/extract, but since this isn't an actual document it still doesn't work. Is there a way to cope with these non UTF-8 characters using the /update method I'm currently using by altering the content type or something? Maybe altering the request handler? Or is it by virtue of text/xml that I cannot use these characters and need to write logic into the application to strip them out? Any thoughts or advice would be appreciated! Thanks! -Teague