RE: Use of scanned documents for text extraction and indexing

2009-02-27 Thread Sudarsan, Sithu D.
e.org; Shashi Kant Subject: Re: Use of scanned documents for text extraction and indexing Check this: http://code.google.com/p/ocropus/wiki/FrequentlyAskedQuestions > How well does it work? > The character recognition accuracy of OCRopus right now (04/2007) is about > like Tesseract. That'

Re: Use of scanned documents for text extraction and indexing

2009-02-27 Thread Vikram Kumar
Thu, Feb 26, 2009 at 12:11 PM, Shashi Kant > wrote: > > > Another project worth investigating is Tesseract. > > > > http://code.google.com/p/tesseract-ocr/ > > > > > > > > > > - Original Message > > From: Hannes Carl Meyer > > To: solr-user@lucene.a

Re: Use of scanned documents for text extraction and indexing

2009-02-26 Thread Bastian Buch
terested in this too. --Renaud -Original Message- From: Sudarsan, Sithu D. [mailto:sithu.sudar...@fda.hhs.gov] Sent: Thursday, February 26, 2009 8:29 AM To: solr-user@lucene.apache.org; java-u...@lucene.apache.org Subject: Use of scanned documents for text extraction and indexing Hi All:

Re: Use of scanned documents for text extraction and indexing

2009-02-26 Thread Shashi Kant
ebruary 26, 2009 9:21:07 PM Subject: Re: Use of scanned documents for text extraction and indexing Tesseract is pure OCR. Ocropus builds on Tesseract. Vikram On Thu, Feb 26, 2009 at 12:11 PM, Shashi Kant wrote: > Another project worth investigating is Tesseract. > > http://code.google.

Re: Use of scanned documents for text extraction and indexing

2009-02-26 Thread Vikram Kumar
Carl Meyer > To: solr-user@lucene.apache.org > Sent: Thursday, February 26, 2009 11:35:14 AM > Subject: Re: Use of scanned documents for text extraction and indexing > > Hi Sithu, > > there is a project called ocropus done by the DFKI, check the online demo > here: http

RE: Use of scanned documents for text extraction and indexing

2009-02-26 Thread Renaud Waldura
apache.org Subject: Use of scanned documents for text extraction and indexing Hi All: Is there any study / research done on using scanned paper documents as images (may be PDF), and then use some OCR or other technique for extracting text, and the resultant index quality? Thanks in advanc

Re: Use of scanned documents for text extraction and indexing

2009-02-26 Thread Shashi Kant
Another project worth investigating is Tesseract. http://code.google.com/p/tesseract-ocr/ - Original Message From: Hannes Carl Meyer To: solr-user@lucene.apache.org Sent: Thursday, February 26, 2009 11:35:14 AM Subject: Re: Use of scanned documents for text extraction and indexing

RE: Use of scanned documents for text extraction and indexing

2009-02-26 Thread Sudarsan, Sithu D.
@lucene.apache.org Subject: Re: Use of scanned documents for text extraction and indexing Hi Sithu, there is a project called ocropus done by the DFKI, check the online demo here: http://demo.iupr.org/cgi-bin/main.cgi And also http://sites.google.com/site/ocropus/ Regards Hannes m

Re: Use of scanned documents for text extraction and indexing

2009-02-26 Thread Hannes Carl Meyer
Hi Sithu, there is a project called ocropus done by the DFKI, check the online demo here: http://demo.iupr.org/cgi-bin/main.cgi And also http://sites.google.com/site/ocropus/ Regards Hannes m...@hcmeyer.com http://mimblog.de On Thu, Feb 26, 2009 at 5:29 PM, Sudarsan, Sithu D. < sithu.sudar...

Use of scanned documents for text extraction and indexing

2009-02-26 Thread Sudarsan, Sithu D.
Hi All: Is there any study / research done on using scanned paper documents as images (may be PDF), and then use some OCR or other technique for extracting text, and the resultant index quality? Thanks in advance, Sithu D Sudarsan sithu.sudar...@fda.hhs.gov sdsudar...@ualr.edu