No offense taken. More on this topic ( opinion only): even the best OCR has a quality ratio, say 95% or 98% correct. And OCR is slow, maybe a minute per image. So it is best to OCR into a filesystem or DB, assess the quality, then index from the DB. Cheers -- Rick
On February 12, 2017 1:55:10 PM EST, "Игорь Абрашин" <vjiaste...@gmail.com> wrote: >Actually, i dont know how to do it((( For now ive just created request >handler and update chain proccessor for it with capability to detect >during >recognize process (LanguageDetect or somthing like that). Really >appreciate >for any instructions. >Sorry, if i was rude, bad english skill for good russian guy)))) > >11 февр. 2017 г. 19:44 пользователь "Rick Leir" <rl...@leirtech.com> >написал: > >> Yes, you are right. I was just trying to help, and did not have time >to >> dig out the details. So the question is: how do you tell Solr to pass >the >> language arg to Tika and Tesseract? >> >> On February 11, 2017 12:54:02 AM EST, "Игорь Абрашин" < >> vjiaste...@gmail.com> wrote: >> >Hi, Rick. >> >I didnt mean that he need to train, because tesseract works well >> >separetly. >> >So, tika included in solr doesnt try to use russian dict to >recognize >> >cyrillic text and result comes up utilize only eng alphabet. >> > >> >10 февр. 2017 г. 15:28 пользователь "Rick Leir" <rl...@leirtech.com> >> >написал: >> > >> >> My guess is that you are using using Tika and Tesseract. The >latter >> >is >> >> complex, and you can start learning at >> >> >> >> https://wiki.apache.org/tika/TikaOCR <--shows you how to work >with >> >TIFF >> >> >> >> The traineddata for Cyrillic is here: >> >> >> >> https://github.com/tesseract-ocr/tesseract/wiki/Data-Files >> >> >> >> https://github.com/tesseract-ocr/tesseract/issues/147 >> >> >> >> You likely need to enhance the images before running Tesseract. >> >> >> >> cheers -- Rick >> >> >> >> On 2017-02-10 05:03 AM, Игорь Абрашин wrote: >> >> >> >>> Hello, community! >> >>> Did you manage to recognize jpf,tiff or whatever with cyrillics >text >> >>> inside? >> >>> Ive got only latin letter (looks like ugly translite text) in >result >> >for >> >>> that moment.For image contains only lattin letters it works fine. >> >>> Does anyone have any suggestion, best practice or case studies >refer >> >to >> >>> this situation? >> >>> >> >>> >> >> >> >> -- >> Sent from my Android device with K-9 Mail. Please excuse my brevity. -- Sent from my Android device with K-9 Mail. Please excuse my brevity.