Re: OCR image contains cyrillic characters

Rick Leir Sun, 12 Feb 2017 13:28:40 -0800

No offense taken. 

More on this topic ( opinion only): even the best OCR has a quality ratio, say 
95% or 98% correct. And OCR is slow, maybe a minute per image. So it is best to 
OCR into a filesystem or DB, assess the quality, then index from the DB. 
Cheers -- Rick


On February 12, 2017 1:55:10 PM EST, "Игорь Абрашин" <vjiaste...@gmail.com> 
wrote:
>Actually, i dont know how to do it((( For now ive just created request
>handler and update chain proccessor for it with capability to detect
>during
>recognize process (LanguageDetect or somthing like that). Really
>appreciate
>for any instructions.
>Sorry, if i was rude, bad english skill for good russian guy))))
>
>11 февр. 2017 г. 19:44 пользователь "Rick Leir" <rl...@leirtech.com>
>написал:
>
>> Yes, you are right. I was just trying to help, and did not have time
>to
>> dig out the details. So the question is: how do you tell Solr to pass
>the
>> language arg to Tika and Tesseract?
>>
>> On February 11, 2017 12:54:02 AM EST, "Игорь Абрашин" <
>> vjiaste...@gmail.com> wrote:
>> >Hi, Rick.
>> >I didnt mean that he need to train, because tesseract works well
>> >separetly.
>> >So, tika included in solr doesnt try to use russian dict to
>recognize
>> >cyrillic text and result comes up utilize only eng alphabet.
>> >
>> >10 февр. 2017 г. 15:28 пользователь "Rick Leir" <rl...@leirtech.com>
>> >написал:
>> >
>> >> My guess is that you are using using Tika and Tesseract. The
>latter
>> >is
>> >> complex, and you can start learning at
>> >>
>> >> https://wiki.apache.org/tika/TikaOCR   <--shows you how to work
>with
>> >TIFF
>> >>
>> >> The traineddata for Cyrillic is here:
>> >>
>> >> https://github.com/tesseract-ocr/tesseract/wiki/Data-Files
>> >>
>> >> https://github.com/tesseract-ocr/tesseract/issues/147
>> >>
>> >> You likely need to enhance the images before running Tesseract.
>> >>
>> >> cheers -- Rick
>> >>
>> >> On 2017-02-10 05:03 AM, Игорь Абрашин wrote:
>> >>
>> >>> Hello, community!
>> >>> Did you manage to recognize jpf,tiff or whatever with cyrillics
>text
>> >>> inside?
>> >>> Ive got only latin letter (looks like ugly translite text) in
>result
>> >for
>> >>> that moment.For image contains only lattin letters it works fine.
>> >>> Does anyone have any suggestion, best practice or case studies
>refer
>> >to
>> >>> this situation?
>> >>>
>> >>>
>> >>
>>
>> --
>> Sent from my Android device with K-9 Mail. Please excuse my brevity.

-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: OCR image contains cyrillic characters

Reply via email to