Hi,

Solr3.6 is just out with Tika 1.0. Can you try that? Also, Solr TRUNK now has 
Tika 1.1...
I recommend downloading Tika-App and testing your offending files directly with 
that http://tika.apache.org/1.1/gettingstarted.html

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Solr Training - www.solrtraining.com

On 16. apr. 2012, at 14:14, Roman K wrote:

> Hello,
> I am running some tests to see, whether we can use Solr in our organization.
> I have to be able to process MS Word .docx files and then be able to search 
> them as they were simple plain text.
> 
> The problem is that when processing the docx files, the result that I get 
> while running the *:* query is:
> 
> <arr name="text"><str>_rels/.rels
> 
> word/fontTable.xml
> 
> word/_rels/document.xml.rels
> 
> word/document.xml
> 
> word/styles.xml
> 
> docProps/app.xml
> 
> docProps/core.xml
> 
> [Content_Types].xml
> 
> </str></arr>
> 
> which are the names of the xml files that are "zipped" inside the docx file.
> For regular doc/odt files, everything works great and I get the text from 
> inside the document.
> 
> I am using the slightly modified example which comes with the Solr 3.5 
> download.
> My tika-data-config file is:
> 
> <dataConfig>
> <dataSource type="BinFileDataSource" name="bin"/>
> <document>
> <entity name="f" processor="FileListEntityProcessor" recursive="true"
>                rootEntity="false"
>                dataSource="null" baseDir="/myDir/Documents"
>                fileName=".*\.(docx)|(DOCX)" onError="skip">
> <entity name="tika-test" processor="TikaEntityProcessor" 
> url="${f.fileAbsolutePath}" dataSource="bin" format="text">
> <field column="text" name="text"/>
> </entity>
> </entity>
> </document>
> </dataConfig>
> 
> the "text" fieldType and field from schema.xml looks like:
> <fieldType name="text" class="solr.TextField" positionIncrementGap="100">
> <analyzer type="index">
> <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" 
> generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" 
> splitOnCaseChange="1"/>
> <filter class="solr.LowerCaseFilterFactory"/>
> <filter class="solr.PorterStemFilterFactory"/>-->
> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
> </analyzer>
> 
> <fields>
> <field name="text" type="text" indexed="true" stored="true" 
> multiValued="true"/>
> </fields>
> 
> Tika version used is 0.10 (default that came with Solr 3.5). Downgrade to 0.9 
> didn't help.
> The same issue is with docx files saved both from MS Word 2007/2010 and from 
> LibreOffice Writer both on Windows and Ubuntu.
> Regular doc/odt files work perfect.
> 
> 
> Thanks in advance for your help,
> Roman.
> 

Reply via email to