Hello, You really want to fix this before indexing, so you don't index garbage. One way to fix this is to make use of dictionaries while looking at two tokens at a time (current + next). Then you might see that neither "fo" or "cus" are in the dictionary, but that "focus" is, so you might concatenate the tokens and output just one "focus" token. You'd do something similar with "fo-" and "cus".
Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch ----- Original Message ---- > From: Bauke Jan Douma <bjdo...@xs4all.nl> > To: solr-user@lucene.apache.org > Sent: Tuesday, May 26, 2009 4:42:39 PM > Subject: How to deal with hyphens in PDF documents? > > Good day, fellow solr users, > > Fair warning: > ------------- > I am utterly new to solr and to this mailing list (and to lucene for that > matter). > I have been playing with solr for about two weeks. > > > Goal: > ----- > I would like to index several thousand OCR'd newspaper articles, stored as PDF > documents. I have been also been fiddling with PDFBox (tika), and with > pdftotext > in > that regard. > Ultimately, I would like to present search results having a URL to the > original > PDF, > which when clicked, opens up the PDF with the search terms highlighted. > > > Problem: hyphens (using PDFBox): > -------------------------------- > Said newspaper articles are in Dutch. Now that language has the peculiarity > that > hyphenated words at EOL are a very common occurrence. > > The OCR'ed PDF's contain both soft and hard hyphens. Let's take the word > 'focus' > for example (focus in English), which is hyphenated as 'fo - cus', neither > part > of > which are Dutch words by the way. > > Currently, in the XML search-results, using tika PDFBox, this can occur as: > > fo- cus (when the original PDF has a hard hyphen here, U+002D) > fo cus (when the original PDF has a soft hyphen here, U+00AD) > > The problem is that neither of these would be found with a search term of > 'focus'. > I'v been googling for this for the past few days, but haven't seen this issue > addressed anywhere. I must be overlooking something very obvious. > > > Alternative? (using pdftotext): > ------------------------------- > I was thinking of an alternative: using pdftotext to extract the content, run > it > through some custom filter to unhyphenate hyphenated words, and index these > separately, besides the indexed original text. That way a search for those > terms > would yield results. > > With my limited knowledge and experience with solr however, presently I see > that > as shifting the same problem more or less, namely to where I want to present a > clickable URL into the original PDF, with a search-string obtained from the > solr > search results (to highlight the term in the PDF). > > > Any thoughts or pointers would be appreciated. > Thanks all in advance for your time. > > Regards, > Bauke Jan Douma