RE: Use of scanned documents for text extraction and indexing

Sudarsan, Sithu D. Fri, 27 Feb 2009 06:46:25 -0800

 

Thanks to all who have responded (Hanners, Shashi, Vikram, Bastian,
Renaud and the rest).


Using OCRopus might provide the flexibility to use multi-column
documents and formatted ones.

Regarding literature on OCR, few follow up of the paper link provided
Renaud do exist, but could not locate anything significant.

I'll update if I can find something useful to report.



Sincerely,
Sithu 
sithu.sudar...@fda.hhs.gov
sdsudar...@ualr.edu

-----Original Message-----
From: Vikram Kumar [mailto:vikrambku...@gmail.com] 
Sent: Friday, February 27, 2009 5:44 AM
To: solr-user@lucene.apache.org; Shashi Kant
Subject: Re: Use of scanned documents for text extraction and indexing

Check this:
http://code.google.com/p/ocropus/wiki/FrequentlyAskedQuestions

> How well does it work?
>
The character recognition accuracy of OCRopus right now (04/2007) is
about
> like Tesseract. That's because the only character recognition plug-in
in
> OCRopus is, in fact, Tesseract. In the future, there will be
additional
> character recognition plug-ins, both for Latin and for other character
sets.
>
The big area of improvement relative to other open source OCR systems
right
> now is in the area of layout analysis; in our benchmarks, OCRopus
greatly
> reduces layout errors compared to other open source systems."
>
OCR is only a part of the solution with scanned documents. i.,e they
recognize text.

For structural/semantic understanding of documents, you need engines
like
OCRopus that can do layout analysis and provide meaningful data for
document
analysis and understanding.

>From their own Wiki:

Should I use OCRopus or Tesseract?
>
You might consider using OCRopus right now if you require layout
analysis,
> if you want to contribute to it, if you find its output format more
> convenient (HTML with embedded OR information), and/or if you
anticipate
> requiring some of its other capabilities in the future (pluggability,
> multiple scripts, statistical language models, etc.).
>
In terms of character error rates, OCRopus performs similar to
Tesseract. In
> terms of layout analysis, OCRopus is significantly better than
Tesseract.
>
The main reasons not to use OCRopus yet is that it hasn't been packaged
yet,
> that it has limited multi-platform support, and that it runs somewhat
> slower. We hope to address all those issues by the beta release."
>


On Thu, Feb 26, 2009 at 11:35 PM, Shashi Kant <shashi_k...@yahoo.com>
wrote:

> Can anyone back that up?
>
> IMHO Tesseract is the state-of-the-art in OCR, but not sure that
"Ocropus
> builds on Tesseract".
> Can you confirm that Vikram has a point?
>
> Shashi
>
>
>
>
> ----- Original Message ----
> From: Vikram Kumar <vikrambku...@gmail.com>
> To: solr-user@lucene.apache.org; Shashi Kant <sk...@sloan.mit.edu>
> Sent: Thursday, February 26, 2009 9:21:07 PM
> Subject: Re: Use of scanned documents for text extraction and indexing
>
> Tesseract is pure OCR. Ocropus builds on Tesseract.
> Vikram
>
> On Thu, Feb 26, 2009 at 12:11 PM, Shashi Kant <shashi_k...@yahoo.com>
> wrote:
>
> > Another project worth investigating is Tesseract.
> >
> > http://code.google.com/p/tesseract-ocr/
> >
> >
> >
> >
> > ----- Original Message ----
> > From: Hannes Carl Meyer <m...@hcmeyer.com>
> > To: solr-user@lucene.apache.org
> > Sent: Thursday, February 26, 2009 11:35:14 AM
> > Subject: Re: Use of scanned documents for text extraction and
indexing
> >
> > Hi Sithu,
> >
> > there is a project called ocropus done by the DFKI, check the online
demo
> > here: http://demo.iupr.org/cgi-bin/main.cgi
> >
> > And also http://sites.google.com/site/ocropus/
> >
> > Regards
> >
> > Hannes
> >
> > m...@hcmeyer.com
> > http://mimblog.de
> >
> > On Thu, Feb 26, 2009 at 5:29 PM, Sudarsan, Sithu D. <
> > sithu.sudar...@fda.hhs.gov> wrote:
> >
> > >
> > > Hi All:
> > >
> > > Is there any study / research done on using scanned paper
documents as
> > > images (may be PDF), and then use some OCR or other technique for
> > > extracting text, and the resultant index quality?
> > >
> > >
> > > Thanks in advance,
> > > Sithu D Sudarsan
> > >
> > > sithu.sudar...@fda.hhs.gov
> > > sdsudar...@ualr.edu
> > >
> > >
> > >
> >
> >
>
>

RE: Use of scanned documents for text extraction and indexing

Reply via email to