Re: Use of scanned documents for text extraction and indexing

Vikram Kumar Fri, 27 Feb 2009 02:44:06 -0800

Check this: http://code.google.com/p/ocropus/wiki/FrequentlyAskedQuestions


> How well does it work?
>
The character recognition accuracy of OCRopus right now (04/2007) is about
> like Tesseract. That's because the only character recognition plug-in in
> OCRopus is, in fact, Tesseract. In the future, there will be additional
> character recognition plug-ins, both for Latin and for other character sets.
>
The big area of improvement relative to other open source OCR systems right
> now is in the area of layout analysis; in our benchmarks, OCRopus greatly
> reduces layout errors compared to other open source systems."
>
OCR is only a part of the solution with scanned documents. i.,e they
recognize text.

For structural/semantic understanding of documents, you need engines like
OCRopus that can do layout analysis and provide meaningful data for document
analysis and understanding.

>From their own Wiki:

Should I use OCRopus or Tesseract?
>
You might consider using OCRopus right now if you require layout analysis,
> if you want to contribute to it, if you find its output format more
> convenient (HTML with embedded OR information), and/or if you anticipate
> requiring some of its other capabilities in the future (pluggability,
> multiple scripts, statistical language models, etc.).
>
In terms of character error rates, OCRopus performs similar to Tesseract. In
> terms of layout analysis, OCRopus is significantly better than Tesseract.
>
The main reasons not to use OCRopus yet is that it hasn't been packaged yet,
> that it has limited multi-platform support, and that it runs somewhat
> slower. We hope to address all those issues by the beta release."
>


On Thu, Feb 26, 2009 at 11:35 PM, Shashi Kant <[email protected]> wrote:

> Can anyone back that up?
>
> IMHO Tesseract is the state-of-the-art in OCR, but not sure that "Ocropus
> builds on Tesseract".
> Can you confirm that Vikram has a point?
>
> Shashi
>
>
>
>
> ----- Original Message ----
> From: Vikram Kumar <[email protected]>
> To: [email protected]; Shashi Kant <[email protected]>
> Sent: Thursday, February 26, 2009 9:21:07 PM
> Subject: Re: Use of scanned documents for text extraction and indexing
>
> Tesseract is pure OCR. Ocropus builds on Tesseract.
> Vikram
>
> On Thu, Feb 26, 2009 at 12:11 PM, Shashi Kant <[email protected]>
> wrote:
>
> > Another project worth investigating is Tesseract.
> >
> > http://code.google.com/p/tesseract-ocr/
> >
> >
> >
> >
> > ----- Original Message ----
> > From: Hannes Carl Meyer <[email protected]>
> > To: [email protected]
> > Sent: Thursday, February 26, 2009 11:35:14 AM
> > Subject: Re: Use of scanned documents for text extraction and indexing
> >
> > Hi Sithu,
> >
> > there is a project called ocropus done by the DFKI, check the online demo
> > here: http://demo.iupr.org/cgi-bin/main.cgi
> >
> > And also http://sites.google.com/site/ocropus/
> >
> > Regards
> >
> > Hannes
> >
> > [email protected]
> > http://mimblog.de
> >
> > On Thu, Feb 26, 2009 at 5:29 PM, Sudarsan, Sithu D. <
> > [email protected]> wrote:
> >
> > >
> > > Hi All:
> > >
> > > Is there any study / research done on using scanned paper documents as
> > > images (may be PDF), and then use some OCR or other technique for
> > > extracting text, and the resultant index quality?
> > >
> > >
> > > Thanks in advance,
> > > Sithu D Sudarsan
> > >
> > > [email protected]
> > > [email protected]
> > >
> > >
> > >
> >
> >
>
>

Re: Use of scanned documents for text extraction and indexing

Reply via email to