No offense taken.
More on this topic ( opinion only): even the best OCR has a quality ratio, say
95% or 98% correct. And OCR is slow, maybe a minute per image. So it is best to
OCR into a filesystem or DB, assess the quality, then index from the DB.
Cheers -- Rick
On February 12, 2017 1:55:10
Actually, i dont know how to do it((( For now ive just created request
handler and update chain proccessor for it with capability to detect during
recognize process (LanguageDetect or somthing like that). Really appreciate
for any instructions.
Sorry, if i was rude, bad english skill for good russi
Yes, you are right. I was just trying to help, and did not have time to dig out
the details. So the question is: how do you tell Solr to pass the language arg
to Tika and Tesseract?
On February 11, 2017 12:54:02 AM EST, "Игорь Абрашин"
wrote:
>Hi, Rick.
>I didnt mean that he need to train, be
Hi, Rick.
I didnt mean that he need to train, because tesseract works well separetly.
So, tika included in solr doesnt try to use russian dict to recognize
cyrillic text and result comes up utilize only eng alphabet.
10 февр. 2017 г. 15:28 пользователь "Rick Leir"
написал:
> My guess is that you
My guess is that you are using using Tika and Tesseract. The latter is
complex, and you can start learning at
https://wiki.apache.org/tika/TikaOCR <--shows you how to work with TIFF
The traineddata for Cyrillic is here:
https://github.com/tesseract-ocr/tesseract/wiki/Data-Files
https://gith