RE: Alternatives to tika for extracting text out of PDFs

2017-12-07 Thread Phil Scadden
December 2017 3:36 p.m. To: solr-user@lucene.apache.org Subject: Re: Alternatives to tika for extracting text out of PDFs No need to prove it. More modern PDF formats are easier to decode, but for many years the text was move-print-move-print, so the font metrics were necessary to guess at spaces. Plus

Re: Alternatives to tika for extracting text out of PDFs

2017-12-07 Thread Walter Underwood
No need to prove it. More modern PDF formats are easier to decode, but for many years the text was move-print-move-print, so the font metrics were necessary to guess at spaces. Plus, the glyph IDs had to be mapped to characters, so some PDFs were effectively a substitution code. Our team joked

Re: Alternatives to tika for extracting text out of PDFs

2017-12-07 Thread Erick Erickson
I'm going to guess it's the exact opposite. The meta-data is the "semi structured" part which is much easier to collect than the PDF. I mean there are parameters to tweak that consider how much space between letters in words (in the body text) should be allowed and still consider it a single word.