Re: Alternatives to tika for extracting text out of PDFs

Erick Erickson Thu, 07 Dec 2017 17:43:11 -0800

I'm going to guess it's the exact opposite. The meta-data is the "semi
structured" part which is much easier to collect than the PDF. I mean
there are parameters to tweak that consider how much space between
letters in words (in the body text) should be allowed and still
consider it a single word. I'm not quite sure how to prove that, but
I'd be willing to make a bet ;)


Erick

On Thu, Dec 7, 2017 at 4:57 PM, Phil Scadden <p.scad...@gns.cri.nz> wrote:
> I am indexing PDFs and a separate process has converted any image PDFs to 
> search PDF before solr gets near it. I notice that tika is very slow at 
> parsing some PDFs. I don't need any metadata (which I suspect is slowing tika 
> down), just the text. Has anyone used an alternative PDF text extraction 
> library in a SOLRJ context?
> Notice: This email and any attachments are confidential and may not be used, 
> published or redistributed without the prior written consent of the Institute 
> of Geological and Nuclear Sciences Limited (GNS Science). If received in 
> error please destroy and immediately notify GNS Science. Do not copy or 
> disclose the contents.

Re: Alternatives to tika for extracting text out of PDFs

Reply via email to