Re: Solr dih extract text from inline images in pdf

Charlie Hull Wed, 07 Mar 2018 07:45:07 -0800

On 07/03/2018 13:29, lala wrote:

Thanks Charlie...
It's just confusing for me, In the DIH configuration file, the inner entity
that takes "TikaEntityProcessor" as its processor, I can easily specify a
tikaConfig attribute to an xml file, located inside the config folder in the
core, and where in this file I should be able to override the PDFParser
default properties... As in parseContext.Config...
The thing is that I placed my tika-config.xml file in the config folder,
set "tikaConfig" attribute = "tika-config.xml"... But tika still not parsing
images inside PDF file!!!
Let's say this is just experimenting Solr DIH crawling... Why it's not
working.?


This is my tika-config.xml file:

<?xml version="1.0" encoding="UTF-8"?>
<properties>
     <parsers>
         <parser class="org.apache.tika.parser.DefaultParser"/>
         <parser class="org.apache.tika.parser.pdf.PDFParser">
             <params>
                 true
                 true
             </params>
         </parser>
     </parsers>
</properties>

I've read the code in both TikaEntityProcessor and TikaConfig... It should
read the xml file from config folder, extract params and override original
PDFParser attributes. But It DOESN'T!
Any Idea??

Hi,

My reading ofhttps://tika.apache.org/1.17/configuring.html#Using_a_Tika_Configuration_XML_fileindicates that your PDF parser may not run unless you explicitly excludePDFs, which I don't think you're doing above.

I'm not an expert on Tika configuration, but I think you should firsttry this xml file with standalone Tika and see if it does what you thinkit should. Once you're sure, then try it with DIH or SolrJ.


Cheers

Charlie




--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html



--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk

Re: Solr dih extract text from inline images in pdf

Reply via email to