Well I have a lot OCRed PDF, but the extremely slow text extract is hard to pin down. The bulk of the OCRed one arent too slow, but then I have one that will take several minutes. I use a little utility, pdftotext.exe, for making a crude guess at whether OCR is necessary and it is much faster (but not that easy to use in the indexing workflow). Some of the big modern ones (fully digital) can also be very slow. Maybe the amount of inline imagery?? Doesn’t seem to bother pdftotext.
-----Original Message----- From: Walter Underwood [mailto:wun...@wunderwood.org] Sent: Friday, 8 December 2017 3:36 p.m. To: solr-user@lucene.apache.org Subject: Re: Alternatives to tika for extracting text out of PDFs No need to prove it. More modern PDF formats are easier to decode, but for many years the text was move-print-move-print, so the font metrics were necessary to guess at spaces. Plus, the glyph IDs had to be mapped to characters, so some PDFs were effectively a substitution code. Our team joked about using cow (crypt breakers workbench) for PDF decoding, but decided it would be a problem for export. I saw one two-column PDF where the glyphs were laid out strictly top to bottom, across both columns. Whee! A friend observed that turning a PDF into a structured document is like turning hamburger back into a cow. The PDF standard has improved a lot, but then you get an OCR’ed PDF. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Dec 7, 2017, at 5:29 PM, Erick Erickson <erickerick...@gmail.com> wrote: > > I'm going to guess it's the exact opposite. The meta-data is the "semi > structured" part which is much easier to collect than the PDF. I mean > there are parameters to tweak that consider how much space between > letters in words (in the body text) should be allowed and still > consider it a single word. I'm not quite sure how to prove that, but > I'd be willing to make a bet ;) > > Erick > > On Thu, Dec 7, 2017 at 4:57 PM, Phil Scadden <p.scad...@gns.cri.nz> wrote: >> I am indexing PDFs and a separate process has converted any image PDFs to >> search PDF before solr gets near it. I notice that tika is very slow at >> parsing some PDFs. I don't need any metadata (which I suspect is slowing >> tika down), just the text. Has anyone used an alternative PDF text >> extraction library in a SOLRJ context? >> Notice: This email and any attachments are confidential and may not be used, >> published or redistributed without the prior written consent of the >> Institute of Geological and Nuclear Sciences Limited (GNS Science). If >> received in error please destroy and immediately notify GNS Science. Do not >> copy or disclose the contents. Notice: This email and any attachments are confidential and may not be used, published or redistributed without the prior written consent of the Institute of Geological and Nuclear Sciences Limited (GNS Science). If received in error please destroy and immediately notify GNS Science. Do not copy or disclose the contents.