Re: Alternatives to tika for extracting text out of PDFs

Walter Underwood Thu, 07 Dec 2017 18:45:51 -0800

No need to prove it. More modern PDF formats are easier to decode, but for many 
years the text was move-print-move-print, so the font metrics were necessary to 
guess at spaces.  Plus, the glyph IDs had to be mapped to characters, so some 
PDFs were effectively a substitution code. Our team joked about using cow 
(crypt breakers workbench) for PDF decoding, but decided it would be a problem 
for export.


I saw one two-column PDF where the glyphs were laid out strictly top to bottom, 
across both columns. Whee!

A friend observed that turning a PDF into a structured document is like turning 
hamburger back into a cow. The PDF standard has improved a lot, but then you 
get an OCR’ed PDF. 

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Dec 7, 2017, at 5:29 PM, Erick Erickson <erickerick...@gmail.com> wrote:
> 
> I'm going to guess it's the exact opposite. The meta-data is the "semi
> structured" part which is much easier to collect than the PDF. I mean
> there are parameters to tweak that consider how much space between
> letters in words (in the body text) should be allowed and still
> consider it a single word. I'm not quite sure how to prove that, but
> I'd be willing to make a bet ;)
> 
> Erick
> 
> On Thu, Dec 7, 2017 at 4:57 PM, Phil Scadden <p.scad...@gns.cri.nz> wrote:
>> I am indexing PDFs and a separate process has converted any image PDFs to 
>> search PDF before solr gets near it. I notice that tika is very slow at 
>> parsing some PDFs. I don't need any metadata (which I suspect is slowing 
>> tika down), just the text. Has anyone used an alternative PDF text 
>> extraction library in a SOLRJ context?
>> Notice: This email and any attachments are confidential and may not be used, 
>> published or redistributed without the prior written consent of the 
>> Institute of Geological and Nuclear Sciences Limited (GNS Science). If 
>> received in error please destroy and immediately notify GNS Science. Do not 
>> copy or disclose the contents.

Re: Alternatives to tika for extracting text out of PDFs

Reply via email to