Hello,

we use Solr 3.5 and Tika to index a lot of PDFs. The content of those PDFs
is searchable via a full-text search.
Also the terms are used to make search suggestions.

Unfortunately pdfbox seems to insert a space character, when there are
soft-hyphens in the content of the PDF
Thus the extracted text is sometimes very fragmented. For example the word
Medizin is extracted as Me di zin.
As a consequence the suggestions are often unusable and the search does not
work as expected.

Has anyone a suggestion how to extract the content of PDF containing
sof-hyphens withpout fragmenting it?

Best
Dirk

Reply via email to