On 02/07/2014 15:19, Manuel Le Normand wrote:
Hello,
Many of our indexed documents are scanned and OCR'ed documents.
Unfortunately we were not able to improve much the OCR quality (less than
80% word accuracy) for various reasons, a fact which badly hurts the
retrieval quality.
As we use an open
Hi Manuel,
I think OCR error correction is one of well-known NLP tasks.
I'd thought it could be implemented in the past by using Lucene.
This is a brief idea:
1. You have got a Lucene index. This existing index is made from
correct (i.e. error free) documents that are same domain of OCR documen
: OCR - Saving multi-term position
Thanks for your answers Erick and Michael.
The term confidence level is an OCR output metric which tells for every
word what are the odds it's the actual scanned term. I wish the OCR prog to
output all the "suspected words" that sum up to above ~90% of
Thanks for your answers Erick and Michael.
The term confidence level is an OCR output metric which tells for every
word what are the odds it's the actual scanned term. I wish the OCR prog to
output all the "suspected words" that sum up to above ~90% of confidence it
is the actual term instead of o
Problem here is that you wind up with a zillion unique terms in your
index, which may lead to performance issues, but you probably already
know that :).
I've seen situations where running it through a dictionary helps. That
is, does each term in the OCR match some dictionary? Problem here is
that
I don't have first hand knowledge of how you implement that, but I bet a
look at the WordDelimiterFilter would help you understand how to emit
multiple terms with the same positions pretty easily.
I've heard of this "bag of word variants" approach to indexing poor-quality
OCR output before for fin
Hello,
Many of our indexed documents are scanned and OCR'ed documents.
Unfortunately we were not able to improve much the OCR quality (less than
80% word accuracy) for various reasons, a fact which badly hurts the
retrieval quality.
As we use an open-source OCR, we think of changing every scanned