Hi,
Maybe the pdf creator tool is not generating a "fluid" text, in pdf has
sections defined by objects, e.g. for "Medizin"

20 0 obj
(Medizin)
endobj

However this can happen

20 0 obj
(Me)
endobj

21 0 obj
(di)
endobj

22 0 obj
(zin)
endobj

See that, there are 3 text objects, the extraction tool can interprete that
as 3 words.
Check you pdf file to make sure that it's well-formed.



On Fri, Feb 10, 2012 at 8:21 AM, Dirk Högemann <
dirk.hoegem...@googlemail.com> wrote:

> Hello,
>
> we use Solr 3.5 and Tika to index a lot of PDFs. The content of those PDFs
> is searchable via a full-text search.
> Also the terms are used to make search suggestions.
>
> Unfortunately pdfbox seems to insert a space character, when there are
> soft-hyphens in the content of the PDF
> Thus the extracted text is sometimes very fragmented. For example the word
> Medizin is extracted as Me di zin.
> As a consequence the suggestions are often unusable and the search does not
> work as expected.
>
> Has anyone a suggestion how to extract the content of PDF containing
> sof-hyphens withpout fragmenting it?
>
> Best
> Dirk
>



-- 
[ ]'s
Shairon Toledo
http://www.google.com/profiles/shairon.toledo

Reply via email to