Re: OCR image contains cyrillic characters

2017-02-12 Thread Rick Leir
No offense taken. More on this topic ( opinion only): even the best OCR has a quality ratio, say 95% or 98% correct. And OCR is slow, maybe a minute per image. So it is best to OCR into a filesystem or DB, assess the quality, then index from the DB. Cheers -- Rick On February 12, 2017 1:55:10

Re: OCR image contains cyrillic characters

2017-02-12 Thread Игорь Абрашин
Actually, i dont know how to do it((( For now ive just created request handler and update chain proccessor for it with capability to detect during recognize process (LanguageDetect or somthing like that). Really appreciate for any instructions. Sorry, if i was rude, bad english skill for good russi

Re: OCR image contains cyrillic characters

2017-02-11 Thread Rick Leir
Yes, you are right. I was just trying to help, and did not have time to dig out the details. So the question is: how do you tell Solr to pass the language arg to Tika and Tesseract? On February 11, 2017 12:54:02 AM EST, "Игорь Абрашин" wrote: >Hi, Rick. >I didnt mean that he need to train, be

Re: OCR image contains cyrillic characters

2017-02-10 Thread Игорь Абрашин
Hi, Rick. I didnt mean that he need to train, because tesseract works well separetly. So, tika included in solr doesnt try to use russian dict to recognize cyrillic text and result comes up utilize only eng alphabet. 10 февр. 2017 г. 15:28 пользователь "Rick Leir" написал: > My guess is that you

Re: OCR image contains cyrillic characters

2017-02-10 Thread Rick Leir
My guess is that you are using using Tika and Tesseract. The latter is complex, and you can start learning at https://wiki.apache.org/tika/TikaOCR <--shows you how to work with TIFF The traineddata for Cyrillic is here: https://github.com/tesseract-ocr/tesseract/wiki/Data-Files https://gith

OCR image contains cyrillic characters

2017-02-10 Thread Игорь Абрашин
Hello, community! Did you manage to recognize jpf,tiff or whatever with cyrillics text inside? Ive got only latin letter (looks like ugly translite text) in result for that moment.For image contains only lattin letters it works fine. Does anyone have any suggestion, best practice or case studies re

Re: Cyrillic characters

2006-07-19 Thread Yonik Seeley
On 7/19/06, Yonik Seeley <[EMAIL PROTECTED]> wrote: Now the problem: Tomcat 5.5.17 isn't decoding percent-encoded UTF-8, but instead treating %C3%A9 as two separate characters. Here's the magic for Tomcat: http://split-s.blogspot.com/2005/12/internationalized-get-parameters-with.html edit serv

Re: Cyrillic characters

2006-07-19 Thread Yonik Seeley
On 7/19/06, WHIRLYCOTT <[EMAIL PROTECTED]> wrote: Solr-trunk currently uses ISO-8859-1 as the character encoding for the admin pages. One of the patches I submitted changes the admin pages to use UTF-8 and that fixes the problem. OK, we are closer to working correctly. It appears that the web

Re: Cyrillic characters

2006-07-19 Thread WHIRLYCOTT
On Jul 19, 2006, at 11:44 AM, Bertrand Delacretaz wrote: -If I search "désormais" from the solr/admin page, it is translated to q=d%E9sormais in the URL, and nothing's found (the word is in my index) http://www.w3.org/TR/REC-html40/interact/forms.html#adef-accept-charset "The default value fo

Re: Re: Cyrillic characters

2006-07-19 Thread Bertrand Delacretaz
On 7/19/06, Yonik Seeley <[EMAIL PROTECTED]> wrote: ...Can anyone else shed some light on this?.. I have to run now but I *think* there are encoding settings in web.xml, and IIRC they might be different for Tomcat or Jetty. Setting UTF-8 everywhere should help. -Bertrand

Re: Cyrillic characters

2006-07-19 Thread WHIRLYCOTT
I submitted two patches that fix one problem with URL encoding and another with the screens on the webapp. http://issues.apache.org/jira/browse/SOLR-35 phil. On Jul 19, 2006, at 11:58 AM, Yonik Seeley wrote: On 7/19/06, Tricia Williams <[EMAIL PROTECTED]> wrote: You mentioned i

Re: Cyrillic characters

2006-07-19 Thread Yonik Seeley
On 7/19/06, Tricia Williams <[EMAIL PROTECTED]> wrote: You mentioned in another earlier post that q=h%c3%e9 would find matching hits. My experience shows that while the UTF-8 encoded query doesn't generate any exceptions, no results are matched. However q=h%e9llo would find matching results

Re: Re: Cyrillic characters

2006-07-19 Thread Bertrand Delacretaz
On 7/19/06, Tricia Williams <[EMAIL PROTECTED]> wrote: ...What I called the _solr url encoding_ was the q= parameter translated into encoding in the url... I think I've seen the same problem, haven't investigated deeper but IIUC the encoding used when posting a form is related to both the enc

Re: Cyrillic characters

2006-07-19 Thread Tricia Williams
q=h%e9llo would find matching results (the result set I'd match in Luke). So assuming that I can fix the form encoding errors so that the characters are encoded as UTF-8, I believe that I would continue to return incorrect results. Will cyrillic characters be treated any differently than th

Re: Cyrillic characters

2006-07-18 Thread WHIRLYCOTT
On Jul 18, 2006, at 5:53 PM, Tricia Williams wrote: that using the packaged example admin interface entering a query with a string of cyrillic characters causes a java.lang.ArrayIndexOutOfBoundsException ... I have this much fixed as well. However, I'm still walking data through the

Re: Cyrillic characters

2006-07-18 Thread Yonik Seeley
On 7/18/06, Tricia Williams <[EMAIL PROTECTED]> wrote: My sample query is: .. (the english word _canada_ translated into russian) or %D0%9A%D0%B0%D0%BD%D0%B0%D0%B4%D0%B0 (utf-8) or %26%231050%3B%26%231072%3B%26%231085%3B%26%231072%3B%26%231076%3B%26%231072%3B (solr url encoding) Hi Tricia,

Re: Cyrillic characters

2006-07-18 Thread WHIRLYCOTT
I've started poking around and have fixed already one bug related to URL encoding of data. I'm going to work some more on this tonight and will hopefully have a patch for you soon. phil. On Jul 18, 2006, at 6:19 PM, Yonik Seeley wrote: On 7/18/06, WHIRLYCOTT <[EMAIL PROTECTED]> wrote: Ho

Re: Cyrillic characters

2006-07-18 Thread Yonik Seeley
Definitely some Firefox bugs with UTF8 at least: If I go to the admin screen, and paste in héllo into the query box, then kill Solr and run netcat to see exactly what I get, it's the following: $ nc -l -p 8983 GET /solr/select/?stylesheet=&q=h%E9llo&version=2.1&start=0&rows=10&indent=on HT TP/1.1

Re: Cyrillic characters

2006-07-18 Thread Chris Hostetter
: ps. I am using mozilla firefox as my main browser which leads to the : behaviour I reported above. IE 6.0 works fine for cyrillics although : there is still a strange but different encoding (%CA%E0%ED%E0%E4%E0 for : the same query as before). The problem may not be in the Solr internals as mu

Re: Cyrillic characters

2006-07-18 Thread Yonik Seeley
OK, lets split up the indexing side from the query side for a moment and assume that you are indexing correctly (setting the content-type correctly, etc). I just added a new value to the multi-valued features field to the solr.xml example document: "Good unicode support: héllo (hello with an acc

Re: Cyrillic characters

2006-07-18 Thread Yonik Seeley
On 7/18/06, WHIRLYCOTT <[EMAIL PROTECTED]> wrote: How much testing have people done using UTF-8 data on Solr? UTF-8 query *output* is well tested with Resin within CNET. Indexing UTF-8 is also well tested (again, mostly with Resin). UTF-8 query input is not really tested at all AFAIK (the q par

Re: Cyrillic characters

2006-07-18 Thread WHIRLYCOTT
lrish. Our old web app was capable of searching for queries with cyrillic characters in them. I'm finding that using the packaged example admin interface entering a query with a string of cyrillic characters causes a java.lang.ArrayIndexOutOfBoundsException. I've also noted that

Cyrillic characters

2006-07-18 Thread Tricia Williams
Hi all, I'm trying to adapt our old cocoon/lucene based web search application to one that is more solrish. Our old web app was capable of searching for queries with cyrillic characters in them. I'm finding that using the packaged example admin interface entering a query with a