: I am using Apache POI parser to parse a Word Doc and extract the text : content. Then i am passing the text content to SOLR. The Word document has : many pictures, graphs and tables. But when i am passing the content to SOLR, : it fails. Here is the exception trace. : : 09:31:04,516 ERROR [STDERR] Mar 14, 2009 9:31:04 AM : org.apache.solr.common.SolrException log : SEVERE: [com.ctc.wstx.exc.WstxLazyException] : com.ctc.wstx.exc.WstxParsingException: Illegal charact : er entity: expansion character (code 0x7) not a valid XML character : at [row,col {unknown-source}]: [40,18]
the error string is fairly self explanatory: on line 40, column 18 you have a character that isn't legal in XML (0x7) (not all UTF-8 characters are legal in XML) If search the solr archives for "Illegal character" you'll find lots of discussion about this and how to deal with this in general. You might also want to check out this comment pointing out some advantages in using Tika instead of using POI directly... https://issues.apache.org/jira/browse/LUCENE-1559?#action_12681347 ..lastly you might wnat to check out this plugin and do all hte hard work server side... http://wiki.apache.org/solr/ExtractingRequestHandler -Hoss