December 2017 3:36 p.m.
To: solr-user@lucene.apache.org
Subject: Re: Alternatives to tika for extracting text out of PDFs
No need to prove it. More modern PDF formats are easier to decode, but for many
years the text was move-print-move-print, so the font metrics were necessary to
guess at spaces. Plus
No need to prove it. More modern PDF formats are easier to decode, but for many
years the text was move-print-move-print, so the font metrics were necessary to
guess at spaces. Plus, the glyph IDs had to be mapped to characters, so some
PDFs were effectively a substitution code. Our team joked
I'm going to guess it's the exact opposite. The meta-data is the "semi
structured" part which is much easier to collect than the PDF. I mean
there are parameters to tweak that consider how much space between
letters in words (in the body text) should be allowed and still
consider it a single word.