Re: OCR image contains cyrillic characters

Игорь Абрашин Fri, 10 Feb 2017 21:54:26 -0800

Hi, Rick.
I didnt mean that he need to train, because tesseract works well separetly.
So, tika included in solr doesnt try to use russian dict to recognize
cyrillic text and result comes up utilize only eng alphabet.


10 февр. 2017 г. 15:28 пользователь "Rick Leir" <rl...@leirtech.com>
написал:

> My guess is that you are using using Tika and Tesseract. The latter is
> complex, and you can start learning at
>
> https://wiki.apache.org/tika/TikaOCR   <--shows you how to work with TIFF
>
> The traineddata for Cyrillic is here:
>
> https://github.com/tesseract-ocr/tesseract/wiki/Data-Files
>
> https://github.com/tesseract-ocr/tesseract/issues/147
>
> You likely need to enhance the images before running Tesseract.
>
> cheers -- Rick
>
> On 2017-02-10 05:03 AM, Игорь Абрашин wrote:
>
>> Hello, community!
>> Did you manage to recognize jpf,tiff or whatever with cyrillics text
>> inside?
>> Ive got only latin letter (looks like ugly translite text) in result for
>> that moment.For image contains only lattin letters it works fine.
>> Does anyone have any suggestion, best practice or case studies refer to
>> this situation?
>>
>>
>

Re: OCR image contains cyrillic characters

Reply via email to