Issue with Solr 3.5 while using TikaEntityProcessor on .docx files

Roman K Mon, 16 Apr 2012 05:15:13 -0700

Hello,
I am running some tests to see, whether we can use Solr in our organization.

I have to be able to process MS Word .docx files and then be able tosearch them as they were simple plain text.

The problem is that when processing the docx files, the result that Iget while running the *:* query is:


<arr name="text"><str>_rels/.rels

word/fontTable.xml

word/_rels/document.xml.rels

word/document.xml

word/styles.xml

docProps/app.xml

docProps/core.xml

[Content_Types].xml

</str></arr>

which are the names of the xml files that are "zipped" inside the docx file.

For regular doc/odt files, everything works great and I get the textfrom inside the document.

I am using the slightly modified example which comes with the Solr 3.5download.

My tika-data-config file is:

<dataConfig>
<dataSource type="BinFileDataSource" name="bin"/>
<document>
<entity name="f" processor="FileListEntityProcessor" recursive="true"
                rootEntity="false"
                dataSource="null" baseDir="/myDir/Documents"
                fileName=".*\.(docx)|(DOCX)" onError="skip">

<entity name="tika-test" processor="TikaEntityProcessor"url="${f.fileAbsolutePath}" dataSource="bin" format="text">

<field column="text" name="text"/>
</entity>
</entity>
</document>
</dataConfig>

the "text" fieldType and field from schema.xml looks like:
<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>

<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"generateNumberParts="1" catenateWords="1" catenateNumbers="1"catenateAll="0" splitOnCaseChange="1"/>

<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.PorterStemFilterFactory"/>-->
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>

<fields>

<field name="text" type="text" indexed="true" stored="true"multiValued="true"/>

</fields>

Tika version used is 0.10 (default that came with Solr 3.5). Downgradeto 0.9 didn't help.The same issue is with docx files saved both from MS Word 2007/2010 andfrom LibreOffice Writer both on Windows and Ubuntu.

Regular doc/odt files work perfect.


Thanks in advance for your help,
Roman.

Issue with Solr 3.5 while using TikaEntityProcessor on .docx files

Reply via email to