I think you need to control the parameter "enableAutoSpace" in PDFBox. There's a JIRA for it, but it depends on some Tika1.1 stuff as far I can understand
https://issues.apache.org/jira/browse/SOLR-2930 -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com Solr Training - www.solrtraining.com On 10. feb. 2012, at 11:21, Dirk Högemann wrote: > Hello, > > we use Solr 3.5 and Tika to index a lot of PDFs. The content of those PDFs > is searchable via a full-text search. > Also the terms are used to make search suggestions. > > Unfortunately pdfbox seems to insert a space character, when there are > soft-hyphens in the content of the PDF > Thus the extracted text is sometimes very fragmented. For example the word > Medizin is extracted as Me di zin. > As a consequence the suggestions are often unusable and the search does not > work as expected. > > Has anyone a suggestion how to extract the content of PDF containing > sof-hyphens withpout fragmenting it? > > Best > Dirk