Solr defers to Tika for this. Tika uses getParagraph text from the POI
WordExtractor class:
http://poi.apache.org/apidocs/org/apache/poi/hwpf/extractor/WordExtractor.html
POI appears to be in limbo and I'm not seeing anything in WordExtractor
that looks like it might help you.
I'd inquire at
I am using the Solr nightly build 8/11/09. I have set the text field in the
solrconfig.xml file to be stored. I index an MS Word document and when I
search for a word in the text of the document and it pulls up the xml format.
The text field is showing the text of the document but there are a