On 07/03/2018 13:29, lala wrote:
Thanks Charlie...
It's just confusing for me, In the DIH configuration file, the inner entity
that takes "TikaEntityProcessor" as its processor, I can easily specify a
tikaConfig attribute to an xml file, located inside the config folder in the
core, and where in this file I should be able to override the PDFParser
default properties... As in parseContext.Config...
The thing is that I placed my tika-config.xml file in the config folder,
set "tikaConfig" attribute = "tika-config.xml"... But tika still not parsing
images inside PDF file!!!
Let's say this is just experimenting Solr DIH crawling... Why it's not
working.?
This is my tika-config.xml file:
<?xml version="1.0" encoding="UTF-8"?>
<properties>
<parsers>
<parser class="org.apache.tika.parser.DefaultParser"/>
<parser class="org.apache.tika.parser.pdf.PDFParser">
<params>
true
true
</params>
</parser>
</parsers>
</properties>
I've read the code in both TikaEntityProcessor and TikaConfig... It should
read the xml file from config folder, extract params and override original
PDFParser attributes. But It DOESN'T!
Any Idea??
Hi,
My reading of
https://tika.apache.org/1.17/configuring.html#Using_a_Tika_Configuration_XML_file
indicates that your PDF parser may not run unless you explicitly exclude
PDFs, which I don't think you're doing above.
I'm not an expert on Tika configuration, but I think you should first
try this xml file with standalone Tika and see if it does what you think
it should. Once you're sure, then try it with DIH or SolrJ.
Cheers
Charlie
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
--
Charlie Hull
Flax - Open Source Enterprise Search
tel/fax: +44 (0)8700 118334
mobile: +44 (0)7767 825828
web: www.flax.co.uk