No need to prove it. More modern PDF formats are easier to decode, but for many years the text was move-print-move-print, so the font metrics were necessary to guess at spaces. Plus, the glyph IDs had to be mapped to characters, so some PDFs were effectively a substitution code. Our team joked about using cow (crypt breakers workbench) for PDF decoding, but decided it would be a problem for export.
I saw one two-column PDF where the glyphs were laid out strictly top to bottom, across both columns. Whee! A friend observed that turning a PDF into a structured document is like turning hamburger back into a cow. The PDF standard has improved a lot, but then you get an OCR’ed PDF. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Dec 7, 2017, at 5:29 PM, Erick Erickson <erickerick...@gmail.com> wrote: > > I'm going to guess it's the exact opposite. The meta-data is the "semi > structured" part which is much easier to collect than the PDF. I mean > there are parameters to tweak that consider how much space between > letters in words (in the body text) should be allowed and still > consider it a single word. I'm not quite sure how to prove that, but > I'd be willing to make a bet ;) > > Erick > > On Thu, Dec 7, 2017 at 4:57 PM, Phil Scadden <p.scad...@gns.cri.nz> wrote: >> I am indexing PDFs and a separate process has converted any image PDFs to >> search PDF before solr gets near it. I notice that tika is very slow at >> parsing some PDFs. I don't need any metadata (which I suspect is slowing >> tika down), just the text. Has anyone used an alternative PDF text >> extraction library in a SOLRJ context? >> Notice: This email and any attachments are confidential and may not be used, >> published or redistributed without the prior written consent of the >> Institute of Geological and Nuclear Sciences Limited (GNS Science). If >> received in error please destroy and immediately notify GNS Science. Do not >> copy or disclose the contents.