Re: Solr dih extract text from inline images in pdf

Charlie Hull Wed, 07 Mar 2018 01:44:10 -0800

On 07/03/2018 09:32, lala wrote:

Thanks for your reply Erick,


Actually I am using Solrj to index files among other operations with Solr,
but to index a large amount of differesnt kinds of file, I'm sending a DIH
request to Solr using Solrj API : FileListEntityProcessor with
TikaEntityParser...
Why not benefit from this technology if Solr offers it? It simplifies our
work tremendosely...

It may simplify your work, but it isn't good practice. Tika has someheavy lifting to do to extract text from some formats and you shouldconsider how this load will affect Solr. We've often put Tika into adifferent process for this reason.

Isn't there any way to be able to extract inline images in PDF docs??

https://stackoverflow.com/questions/31303735/how-to-extract-images-from-a-file-using-apache-tikahas some useful suggestions.


Charlie


Waiting your reply, best regards...



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html



--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk

Re: Solr dih extract text from inline images in pdf

Reply via email to