Re: Use of scanned documents for text extraction and indexing

Shashi Kant Thu, 26 Feb 2009 09:11:34 -0800

Another project worth investigating is Tesseract.

http://code.google.com/p/tesseract-ocr/





----- Original Message ----
From: Hannes Carl Meyer <m...@hcmeyer.com>
To: solr-user@lucene.apache.org
Sent: Thursday, February 26, 2009 11:35:14 AM
Subject: Re: Use of scanned documents for text extraction and indexing

Hi Sithu,

there is a project called ocropus done by the DFKI, check the online demo
here: http://demo.iupr.org/cgi-bin/main.cgi

And also http://sites.google.com/site/ocropus/

Regards

Hannes

m...@hcmeyer.com
http://mimblog.de

On Thu, Feb 26, 2009 at 5:29 PM, Sudarsan, Sithu D. <
sithu.sudar...@fda.hhs.gov> wrote:

>
> Hi All:
>
> Is there any study / research done on using scanned paper documents as
> images (may be PDF), and then use some OCR or other technique for
> extracting text, and the resultant index quality?
>
>
> Thanks in advance,
> Sithu D Sudarsan
>
> sithu.sudar...@fda.hhs.gov
> sdsudar...@ualr.edu
>
>
>

Re: Use of scanned documents for text extraction and indexing

Reply via email to