Hi, Maybe the pdf creator tool is not generating a "fluid" text, in pdf has sections defined by objects, e.g. for "Medizin"
20 0 obj (Medizin) endobj However this can happen 20 0 obj (Me) endobj 21 0 obj (di) endobj 22 0 obj (zin) endobj See that, there are 3 text objects, the extraction tool can interprete that as 3 words. Check you pdf file to make sure that it's well-formed. On Fri, Feb 10, 2012 at 8:21 AM, Dirk Högemann < dirk.hoegem...@googlemail.com> wrote: > Hello, > > we use Solr 3.5 and Tika to index a lot of PDFs. The content of those PDFs > is searchable via a full-text search. > Also the terms are used to make search suggestions. > > Unfortunately pdfbox seems to insert a space character, when there are > soft-hyphens in the content of the PDF > Thus the extracted text is sometimes very fragmented. For example the word > Medizin is extracted as Me di zin. > As a consequence the suggestions are often unusable and the search does not > work as expected. > > Has anyone a suggestion how to extract the content of PDF containing > sof-hyphens withpout fragmenting it? > > Best > Dirk > -- [ ]'s Shairon Toledo http://www.google.com/profiles/shairon.toledo