Thanks to all who have responded (Hanners, Shashi, Vikram, Bastian, Renaud and the rest).
Using OCRopus might provide the flexibility to use multi-column documents and formatted ones. Regarding literature on OCR, few follow up of the paper link provided Renaud do exist, but could not locate anything significant. I'll update if I can find something useful to report. Sincerely, Sithu sithu.sudar...@fda.hhs.gov sdsudar...@ualr.edu -----Original Message----- From: Vikram Kumar [mailto:vikrambku...@gmail.com] Sent: Friday, February 27, 2009 5:44 AM To: solr-user@lucene.apache.org; Shashi Kant Subject: Re: Use of scanned documents for text extraction and indexing Check this: http://code.google.com/p/ocropus/wiki/FrequentlyAskedQuestions > How well does it work? > The character recognition accuracy of OCRopus right now (04/2007) is about > like Tesseract. That's because the only character recognition plug-in in > OCRopus is, in fact, Tesseract. In the future, there will be additional > character recognition plug-ins, both for Latin and for other character sets. > The big area of improvement relative to other open source OCR systems right > now is in the area of layout analysis; in our benchmarks, OCRopus greatly > reduces layout errors compared to other open source systems." > OCR is only a part of the solution with scanned documents. i.,e they recognize text. For structural/semantic understanding of documents, you need engines like OCRopus that can do layout analysis and provide meaningful data for document analysis and understanding. >From their own Wiki: Should I use OCRopus or Tesseract? > You might consider using OCRopus right now if you require layout analysis, > if you want to contribute to it, if you find its output format more > convenient (HTML with embedded OR information), and/or if you anticipate > requiring some of its other capabilities in the future (pluggability, > multiple scripts, statistical language models, etc.). > In terms of character error rates, OCRopus performs similar to Tesseract. In > terms of layout analysis, OCRopus is significantly better than Tesseract. > The main reasons not to use OCRopus yet is that it hasn't been packaged yet, > that it has limited multi-platform support, and that it runs somewhat > slower. We hope to address all those issues by the beta release." > On Thu, Feb 26, 2009 at 11:35 PM, Shashi Kant <shashi_k...@yahoo.com> wrote: > Can anyone back that up? > > IMHO Tesseract is the state-of-the-art in OCR, but not sure that "Ocropus > builds on Tesseract". > Can you confirm that Vikram has a point? > > Shashi > > > > > ----- Original Message ---- > From: Vikram Kumar <vikrambku...@gmail.com> > To: solr-user@lucene.apache.org; Shashi Kant <sk...@sloan.mit.edu> > Sent: Thursday, February 26, 2009 9:21:07 PM > Subject: Re: Use of scanned documents for text extraction and indexing > > Tesseract is pure OCR. Ocropus builds on Tesseract. > Vikram > > On Thu, Feb 26, 2009 at 12:11 PM, Shashi Kant <shashi_k...@yahoo.com> > wrote: > > > Another project worth investigating is Tesseract. > > > > http://code.google.com/p/tesseract-ocr/ > > > > > > > > > > ----- Original Message ---- > > From: Hannes Carl Meyer <m...@hcmeyer.com> > > To: solr-user@lucene.apache.org > > Sent: Thursday, February 26, 2009 11:35:14 AM > > Subject: Re: Use of scanned documents for text extraction and indexing > > > > Hi Sithu, > > > > there is a project called ocropus done by the DFKI, check the online demo > > here: http://demo.iupr.org/cgi-bin/main.cgi > > > > And also http://sites.google.com/site/ocropus/ > > > > Regards > > > > Hannes > > > > m...@hcmeyer.com > > http://mimblog.de > > > > On Thu, Feb 26, 2009 at 5:29 PM, Sudarsan, Sithu D. < > > sithu.sudar...@fda.hhs.gov> wrote: > > > > > > > > Hi All: > > > > > > Is there any study / research done on using scanned paper documents as > > > images (may be PDF), and then use some OCR or other technique for > > > extracting text, and the resultant index quality? > > > > > > > > > Thanks in advance, > > > Sithu D Sudarsan > > > > > > sithu.sudar...@fda.hhs.gov > > > sdsudar...@ualr.edu > > > > > > > > > > > > > > >