RE: Use of scanned documents for text extraction and indexing

2009-02-27 Thread Sudarsan, Sithu D.
e.org; Shashi Kant Subject: Re: Use of scanned documents for text extraction and indexing Check this: http://code.google.com/p/ocropus/wiki/FrequentlyAskedQuestions > How well does it work? > The character recognition accuracy of OCRopus right now (04/2007) is about > like Tesseract. That'

Re: Use of scanned documents for text extraction and indexing

2009-02-27 Thread Vikram Kumar
Thu, Feb 26, 2009 at 12:11 PM, Shashi Kant > wrote: > > > Another project worth investigating is Tesseract. > > > > http://code.google.com/p/tesseract-ocr/ > > > > > > > > > > - Original Message > > From: Hannes Carl Meyer > > To: solr-user@lucene.a

Re: Use of scanned documents for text extraction and indexing

2009-02-26 Thread Bastian Buch
You can use Tesseract, an openSource OCR Engine owned from Google. Its native C Code and to use it in Java you should use JNI or direct process creation. There is no PDF support, but you can use imagemagick to convert those docs on the fly. The engine scan documents line by line without trying

Re: Use of scanned documents for text extraction and indexing

2009-02-26 Thread Shashi Kant
ebruary 26, 2009 9:21:07 PM Subject: Re: Use of scanned documents for text extraction and indexing Tesseract is pure OCR. Ocropus builds on Tesseract. Vikram On Thu, Feb 26, 2009 at 12:11 PM, Shashi Kant wrote: > Another project worth investigating is Tesseract. > > http://code.google.

Re: Use of scanned documents for text extraction and indexing

2009-02-26 Thread Vikram Kumar
Carl Meyer > To: solr-user@lucene.apache.org > Sent: Thursday, February 26, 2009 11:35:14 AM > Subject: Re: Use of scanned documents for text extraction and indexing > > Hi Sithu, > > there is a project called ocropus done by the DFKI, check the online demo > here: http

RE: Use of scanned documents for text extraction and indexing

2009-02-26 Thread Renaud Waldura
There is quite a bit of litterature available on this topic. This paper presents a summary. Nothing immediately applicable I'm afraid. Retrieving OCR Text: A survey of current approaches Steven M. Beitzel, Eric C. Jensen, David A Grossman Illinois Institute of Technology It lists a number of othe

Re: Use of scanned documents for text extraction and indexing

2009-02-26 Thread Shashi Kant
Another project worth investigating is Tesseract. http://code.google.com/p/tesseract-ocr/ - Original Message From: Hannes Carl Meyer To: solr-user@lucene.apache.org Sent: Thursday, February 26, 2009 11:35:14 AM Subject: Re: Use of scanned documents for text extraction and indexing

RE: Use of scanned documents for text extraction and indexing

2009-02-26 Thread Sudarsan, Sithu D.
@lucene.apache.org Subject: Re: Use of scanned documents for text extraction and indexing Hi Sithu, there is a project called ocropus done by the DFKI, check the online demo here: http://demo.iupr.org/cgi-bin/main.cgi And also http://sites.google.com/site/ocropus/ Regards Hannes m

Re: Use of scanned documents for text extraction and indexing

2009-02-26 Thread Hannes Carl Meyer
Hi Sithu, there is a project called ocropus done by the DFKI, check the online demo here: http://demo.iupr.org/cgi-bin/main.cgi And also http://sites.google.com/site/ocropus/ Regards Hannes m...@hcmeyer.com http://mimblog.de On Thu, Feb 26, 2009 at 5:29 PM, Sudarsan, Sithu D. < sithu.sudar...