Re: OCR - Saving multi-term position

Charlie Hull Thu, 03 Jul 2014 00:44:27 -0700

On 02/07/2014 15:19, Manuel Le Normand wrote:

Hello,
Many of our indexed documents are scanned and OCR'ed documents.
Unfortunately we were not able to improve much the OCR quality (less than
80% word accuracy) for various reasons, a fact which badly hurts the
retrieval quality.


As we use an open-source OCR, we think of changing every scanned term
output to it's main possible variations to get a higher level of confidence.

Is there any analyser that supports this kind of need or should I make up a
syntax and analyser of my own, i.e the payload syntax?

The quick brown fox --> The|1 Tlne|1 quick|2 quiok|2 browm|3 brown|3 fox|4

Thanks,
Manuel

Hi Manuel,

We've done something like this for several of our media monitoringclients. The OCR system they use (ABBYY Fine Reader I think, it's prettymuch an industry standard) has well-known error statistics - we know thetop N things it gets wrong, i.e. scanning 'm' as two 'n's - so we canimplement a kind of fuzzy search without introducing too many extra terms.

It isn't quite that simple as we're doing a lot of reverse searching('which queries match this document') but the approach is certainlysound. The following talk from Lucene Revolution is about this kind ofthing: http://www.youtube.com/watch?v=rmRCsrJp2A8


Cheers

Charlie

--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk

Re: OCR - Saving multi-term position

Reply via email to