On 07/03/2018 13:29, lala wrote:
Thanks Charlie...
It's just confusing for me, In the DIH configuration file, the inner entity
that takes "TikaEntityProcessor" as its processor, I can easily specify a
tikaConfig attribute to an xml file, located inside the config folder in the
core, and where in this file I should be able to override the PDFParser
default properties... As in parseContext.Config...
The thing is that I placed my tika-config.xml file in the config folder,
set "tikaConfig" attribute = "tika-config.xml"... But tika still not parsing
images inside PDF file!!!
Let's say this is just experimenting Solr DIH crawling... Why it's not
working.?

This is my tika-config.xml file:

<?xml version="1.0" encoding="UTF-8"?>
<properties>
     <parsers>
         <parser class="org.apache.tika.parser.DefaultParser"/>
         <parser class="org.apache.tika.parser.pdf.PDFParser">
             <params>
                 true
                 true
             </params>
         </parser>
     </parsers>
</properties>

I've read the code in both TikaEntityProcessor and TikaConfig... It should
read the xml file from config folder, extract params and override original
PDFParser attributes. But It DOESN'T!
Any Idea??

Hi,

My reading of https://tika.apache.org/1.17/configuring.html#Using_a_Tika_Configuration_XML_file indicates that your PDF parser may not run unless you explicitly exclude PDFs, which I don't think you're doing above.

I'm not an expert on Tika configuration, but I think you should first try this xml file with standalone Tika and see if it does what you think it should. Once you're sure, then try it with DIH or SolrJ.

Cheers

Charlie



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html



--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk

Reply via email to