Thanks so far. I will have a closer look at the PDF.

I tried the enableautospace setting with pdfbox1.6 - did not work:

PDFParser parser = new PDFParser();
               parser.setEnableAutoSpace(false);
               ContentHandler handler = new BodyContentHandler();

Output:
Va ri an te Creutz feldt-
Ja kob-Krank heit
Stel lung nah men des Ar beits krei ses Blut

Our suggest component and parts of our search is getting hard to use by
this. Any other ideas?

Best
Dirk


2012/2/10 Jan Høydahl <jan....@cominvent.com>

> I think you need to control the parameter "enableAutoSpace" in PDFBox.
> There's a JIRA for it, but it depends on some Tika1.1 stuff as far I can
> understand
>
> https://issues.apache.org/jira/browse/SOLR-2930
>
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
> Solr Training - www.solrtraining.com
>
> On 10. feb. 2012, at 11:21, Dirk Högemann wrote:
>
> > Hello,
> >
> > we use Solr 3.5 and Tika to index a lot of PDFs. The content of those
> PDFs
> > is searchable via a full-text search.
> > Also the terms are used to make search suggestions.
> >
> > Unfortunately pdfbox seems to insert a space character, when there are
> > soft-hyphens in the content of the PDF
> > Thus the extracted text is sometimes very fragmented. For example the
> word
> > Medizin is extracted as Me di zin.
> > As a consequence the suggestions are often unusable and the search does
> not
> > work as expected.
> >
> > Has anyone a suggestion how to extract the content of PDF containing
> > sof-hyphens withpout fragmenting it?
> >
> > Best
> > Dirk
>
>

Reply via email to