Re: How to deal with hyphens in PDF documents?

Peter Kiraly Wed, 27 May 2009 00:45:49 -0700

Hi,

My solution was to this problem in Lucene, that I modified the

Lucene's parser. There was a file in Lucene not in Java(StandardTokenizer.jj),

which defines what is a token, and the types of tokens. My rule
was, that a soft or hard hypen at the end of the line denote a
word which continues in the beginning of the next line. I used
iText instead of PDFBox, because PDFBox was ignoring hypens
at the end of the line. It was years before. Now the file called
StandardTokenizerImpl.jflex. I don't know how to solve it in Solr,
because it incorporates severeal Lucene jars, and not clear for
me how to hack only one jar.


Király Péter
http://extensiblecatalog.org
http://tesuji.eu

----- Original Message -----From: "Bauke Jan Douma" <bjdo...@xs4all.nl>

To: <solr-user@lucene.apache.org>
Sent: Wednesday, May 27, 2009 1:55 AM
Subject: Re: How to deal with hyphens in PDF documents?

Otis Gospodnetic wrote on 05/26/2009 11:06 PM:
Hello,
You really want to fix this before indexing, so you don't index garbage.One way to fix this is to make use of dictionaries while looking at twotokens at a time (current + next). Then you might see that neither "fo"or "cus" are in the dictionary, but that "focus" is, so you mightconcatenate the tokens and output just one "focus" token. You'd dosomething similar with "fo-" and "cus".
 Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



----- Original Message ----
From: Bauke Jan Douma <bjdo...@xs4all.nl>
To: solr-user@lucene.apache.org
Sent: Tuesday, May 26, 2009 4:42:39 PM
Subject: How to deal with hyphens in PDF documents?

Good day, fellow solr users,

Fair warning:
-------------
I am utterly new to solr and to this mailing list (and to lucene forthat matter).
I have been playing with solr for about two weeks.


Goal:
-----
I would like to index several thousand OCR'd newspaper articles, storedas PDFdocuments. I have been also been fiddling with PDFBox (tika), and withpdftotext in
that regard.
Ultimately, I would like to present search results having a URL to theoriginal PDF,
which when clicked, opens up the PDF with the search terms highlighted.


Problem: hyphens (using PDFBox):
--------------------------------
Said newspaper articles are in Dutch. Now that language has thepeculiarity that
hyphenated words at EOL are a very common occurrence.
The OCR'ed PDF's contain both soft and hard hyphens. Let's take the word'focus'for example (focus in English), which is hyphenated as 'fo - cus',neither part of
which are Dutch words by the way.
Currently, in the XML search-results, using tika PDFBox, this can occuras:
    fo- cus (when the original PDF has a hard hyphen here, U+002D)
    fo cus  (when the original PDF has a soft hyphen here, U+00AD)
The problem is that neither of these would be found with a search termof 'focus'.I'v been googling for this for the past few days, but haven't seen thisissue
addressed anywhere. I must be overlooking something very obvious.


Alternative? (using pdftotext):
-------------------------------
I was thinking of an alternative: using pdftotext to extract thecontent, run itthrough some custom filter to unhyphenate hyphenated words, and indextheseseparately, besides the indexed original text. That way a search forthose terms
would yield results.
With my limited knowledge and experience with solr however, presently Isee thatas shifting the same problem more or less, namely to where I want topresent aclickable URL into the original PDF, with a search-string obtained fromthe solr
search results (to highlight the term in the PDF).


Any thoughts or pointers would be appreciated.
Thanks all in advance for your time.

Regards,
Bauke Jan Douma
Hello Otis,
Understood. But wouldn't that lead to the problem that, when using thesearch result(taking it from the highlighting result in solr -- forgot to mention),that fragment
will not be found in the PDF, since the PDF contains the hyphenated word?
Oops. Just now I discovered that searching multiple-word strings thatcross multiplelines in a PDF doesn't even work to begin with, even when there are nohyphens (evinceon Ubuntu -- don't know if that works in Adobe Acrobat). That looks likean unsolved
problem.

Thank you for your input.

Bauke Jan

Re: How to deal with hyphens in PDF documents?

Reply via email to