Walter,
Well said. (And I love the hamburger conversion analogy - very apt.)
The only thing I will add is that when you have a collection of similar
rich text documents, you might be able to construct queries to respect
internal structures within the documents. If all/most of your documents
hav
You may try to use tesseract tool to check data extraction from pdf or
images and then go forward accordingly. As far as I understand the PDF is
an image and not data. The searchable PDF actually overlays the selectable
text as hidden text over the PDF image. These PDFs can be indexed and
extracted
PDF is not a structured document format. It is a printer control format.
PDF does not have a paragraph marker. Instead, it says to move
to this spot on the page, choose this font, and print this letter. For a
paragraph, it moves farther. For the next letter in a word, it moves a
little bit. Extra
Solr will not do this automatically, the Extracting Request Handler
simply indexes the entire contents of the doc without regard to things
like paragraphs etc. Ditto with HTML. This is actually a task that
requires getting into Tika and using all the bells and whistles there.
I'd recommend two thi