On 02/07/2014 15:19, Manuel Le Normand wrote:
Hello,
Many of our indexed documents are scanned and OCR'ed documents.
Unfortunately we were not able to improve much the OCR quality (less than
80% word accuracy) for various reasons, a fact which badly hurts the
retrieval quality.

As we use an open-source OCR, we think of changing every scanned term
output to it's main possible variations to get a higher level of confidence.

Is there any analyser that supports this kind of need or should I make up a
syntax and analyser of my own, i.e the payload syntax?

The quick brown fox --> The|1 Tlne|1 quick|2 quiok|2 browm|3 brown|3 fox|4

Thanks,
Manuel

Hi Manuel,

We've done something like this for several of our media monitoring clients. The OCR system they use (ABBYY Fine Reader I think, it's pretty much an industry standard) has well-known error statistics - we know the top N things it gets wrong, i.e. scanning 'm' as two 'n's - so we can implement a kind of fuzzy search without introducing too many extra terms.

It isn't quite that simple as we're doing a lot of reverse searching ('which queries match this document') but the approach is certainly sound. The following talk from Lucene Revolution is about this kind of thing: http://www.youtube.com/watch?v=rmRCsrJp2A8

Cheers

Charlie

--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk

Reply via email to