Hi, Solr3.6 is just out with Tika 1.0. Can you try that? Also, Solr TRUNK now has Tika 1.1... I recommend downloading Tika-App and testing your offending files directly with that http://tika.apache.org/1.1/gettingstarted.html
-- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com Solr Training - www.solrtraining.com On 16. apr. 2012, at 14:14, Roman K wrote: > Hello, > I am running some tests to see, whether we can use Solr in our organization. > I have to be able to process MS Word .docx files and then be able to search > them as they were simple plain text. > > The problem is that when processing the docx files, the result that I get > while running the *:* query is: > > <arr name="text"><str>_rels/.rels > > word/fontTable.xml > > word/_rels/document.xml.rels > > word/document.xml > > word/styles.xml > > docProps/app.xml > > docProps/core.xml > > [Content_Types].xml > > </str></arr> > > which are the names of the xml files that are "zipped" inside the docx file. > For regular doc/odt files, everything works great and I get the text from > inside the document. > > I am using the slightly modified example which comes with the Solr 3.5 > download. > My tika-data-config file is: > > <dataConfig> > <dataSource type="BinFileDataSource" name="bin"/> > <document> > <entity name="f" processor="FileListEntityProcessor" recursive="true" > rootEntity="false" > dataSource="null" baseDir="/myDir/Documents" > fileName=".*\.(docx)|(DOCX)" onError="skip"> > <entity name="tika-test" processor="TikaEntityProcessor" > url="${f.fileAbsolutePath}" dataSource="bin" format="text"> > <field column="text" name="text"/> > </entity> > </entity> > </document> > </dataConfig> > > the "text" fieldType and field from schema.xml looks like: > <fieldType name="text" class="solr.TextField" positionIncrementGap="100"> > <analyzer type="index"> > <tokenizer class="solr.WhitespaceTokenizerFactory"/> > <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" > generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" > splitOnCaseChange="1"/> > <filter class="solr.LowerCaseFilterFactory"/> > <filter class="solr.PorterStemFilterFactory"/>--> > <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> > </analyzer> > > <fields> > <field name="text" type="text" indexed="true" stored="true" > multiValued="true"/> > </fields> > > Tika version used is 0.10 (default that came with Solr 3.5). Downgrade to 0.9 > didn't help. > The same issue is with docx files saved both from MS Word 2007/2010 and from > LibreOffice Writer both on Windows and Ubuntu. > Regular doc/odt files work perfect. > > > Thanks in advance for your help, > Roman. >