I'm going to guess it's the exact opposite. The meta-data is the "semi structured" part which is much easier to collect than the PDF. I mean there are parameters to tweak that consider how much space between letters in words (in the body text) should be allowed and still consider it a single word. I'm not quite sure how to prove that, but I'd be willing to make a bet ;)
Erick On Thu, Dec 7, 2017 at 4:57 PM, Phil Scadden <p.scad...@gns.cri.nz> wrote: > I am indexing PDFs and a separate process has converted any image PDFs to > search PDF before solr gets near it. I notice that tika is very slow at > parsing some PDFs. I don't need any metadata (which I suspect is slowing tika > down), just the text. Has anyone used an alternative PDF text extraction > library in a SOLRJ context? > Notice: This email and any attachments are confidential and may not be used, > published or redistributed without the prior written consent of the Institute > of Geological and Nuclear Sciences Limited (GNS Science). If received in > error please destroy and immediately notify GNS Science. Do not copy or > disclose the contents.