There is quite a bit of litterature available on this topic. This paper presents a summary. Nothing immediately applicable I'm afraid.
Retrieving OCR Text: A survey of current approaches Steven M. Beitzel, Eric C. Jensen, David A Grossman Illinois Institute of Technology It lists a number of other papers that are easy to find online. Let me know what you find, I'm interested in this too. --Renaud -----Original Message----- From: Sudarsan, Sithu D. [mailto:sithu.sudar...@fda.hhs.gov] Sent: Thursday, February 26, 2009 8:29 AM To: solr-user@lucene.apache.org; java-u...@lucene.apache.org Subject: Use of scanned documents for text extraction and indexing Hi All: Is there any study / research done on using scanned paper documents as images (may be PDF), and then use some OCR or other technique for extracting text, and the resultant index quality? Thanks in advance, Sithu D Sudarsan sithu.sudar...@fda.hhs.gov sdsudar...@ualr.edu