Re: Use of scanned documents for text extraction and indexing

Bastian Buch Thu, 26 Feb 2009 21:57:20 -0800

You can use Tesseract, an openSource OCR Engine owned from Google. Itsnative C Code and to use it in Java you should use JNI or direct processcreation. There is no PDF support, but you can use imagemagick toconvert those docs on the fly. The engine scan documents line by linewithout trying to resolve "text-boxes", which is a problem with1-n-column texts. But with some image preprocessing you can also solve this.


Cheers Bastian.

http://bastian-buch.de


Renaud Waldura schrieb:

There is quite a bit of litterature available on this topic. This paper
presents a summary. Nothing immediately applicable I'm afraid.

Retrieving OCR Text: A survey of current approaches
Steven M. Beitzel, Eric C. Jensen, David A Grossman
Illinois Institute of Technology

It lists a number of other papers that are easy to find online. Let me know
what you find, I'm interested in this too.

--Renaud

-----Original Message-----

From: Sudarsan, Sithu D. [mailto:sithu.sudar...@fda.hhs.gov]Sent: Thursday, February 26, 2009 8:29 AM

To: solr-user@lucene.apache.org; java-u...@lucene.apache.org
Subject: Use of scanned documents for text extraction and indexing

Hi All:

Is there any study / research done on using scanned paper documents as
images (may be PDF), and then use some OCR or other technique for extracting
text, and the resultant index quality?

Thanks in advance,
Sithu D Sudarsan

sithu.sudar...@fda.hhs.gov
sdsudar...@ualr.edu

Re: Use of scanned documents for text extraction and indexing

Reply via email to