Cyrillic characters
Hi all, I'm trying to adapt our old cocoon/lucene based web search application to one that is more solrish. Our old web app was capable of searching for queries with cyrillic characters in them. I'm finding that using the packaged example admin interface entering a query with a string of cyrillic characters causes a java.lang.ArrayIndexOutOfBoundsException. I've also noted that the url built from the search form is not utf-8 encoded. So obviously if I try to manipulate the query string by inserting a utf-8 encoded string in the q= parameter the values are interpreted incorrectly and as such I cannot use this approach as a work-around. My sample query is: .. (the english word _canada_ translated into russian) or %D0%9A%D0%B0%D0%BD%D0%B0%D0%B4%D0%B0 (utf-8) or %26%231050%3B%26%231072%3B%26%231085%3B%26%231072%3B%26%231076%3B%26%231072%3B (solr url encoding) I would appreciate any advice or suggestions that would allow me to search for cyrillics in solr. If anyone knows why solr is behaving as it does with the strange encoding, a brief explanation of what causes this behaviour could be helpful and what the encoding is (unicode?). If anyone else has force solr to accept utf-8 encoded q= parameters with success I would love to know how you did it. Thanks in advance! Tricia ps. I am using mozilla firefox as my main browser which leads to the behaviour I reported above. IE 6.0 works fine for cyrillics although there is still a strange but different encoding (%CA%E0%ED%E0%E4%E0 for the same query as before).
Re: Cyrillic characters
Crap, you're right. I have a well-tested application that's using UTF-8 everywhere possible and I just tested with some Russian text. Solr's coughing up this as an exception: Jul 18, 2006 6:00:05 PM org.apache.solr.core.SolrException log SEVERE: java.lang.ArrayIndexOutOfBoundsException: 1 at org.apache.solr.search.QueryParsing.parseSort (QueryParsing.java:141) at org.apache.solr.request.StandardRequestHandler.handleRequest (StandardRequestHandler.java:96) at org.apache.solr.core.SolrCore.execute(SolrCore.java:592) at org.apache.solr.servlet.SolrServlet.doGet (SolrServlet.java:94) at javax.servlet.http.HttpServlet.service(HttpServlet.java:596) at javax.servlet.http.HttpServlet.service(HttpServlet.java:689) at org.mortbay.jetty.servlet.ServletHolder.handle (ServletHolder.java:428) at org.mortbay.jetty.servlet.WebApplicationHandler.dispatch (WebApplicationHandler.java:473) at org.mortbay.jetty.servlet.ServletHandler.handle (ServletHandler.java:568) at org.mortbay.http.HttpContext.handle(HttpContext.java:1530) at org.mortbay.jetty.servlet.WebApplicationContext.handle (WebApplicationContext.java:633) at org.mortbay.http.HttpContext.handle(HttpContext.java:1482) at org.mortbay.http.HttpServer.service(HttpServer.java:909) at org.mortbay.http.HttpConnection.service (HttpConnection.java:820) at org.mortbay.http.HttpConnection.handleNext (HttpConnection.java:986) at org.mortbay.http.HttpConnection.handle (HttpConnection.java:837) at org.mortbay.http.SocketListener.handleConnection (SocketListener.java:245) at org.mortbay.util.ThreadedServer.handle (ThreadedServer.java:357) at org.mortbay.util.ThreadPool$PoolThread.run (ThreadPool.java:534) You're going directly against Solr/Jetty, right? Not proxied or mod_rewrite'd through to Apache? Solr isn't properly encoding the data being received by the servlet. I think that I can fix this using some of the tricks that I've learned in building my site. More later. How much testing have people done using UTF-8 data on Solr? phil. On Jul 18, 2006, at 5:53 PM, Tricia Williams wrote: Hi all, I'm trying to adapt our old cocoon/lucene based web search application to one that is more solrish. Our old web app was capable of searching for queries with cyrillic characters in them. I'm finding that using the packaged example admin interface entering a query with a string of cyrillic characters causes a java.lang.ArrayIndexOutOfBoundsException. I've also noted that the url built from the search form is not utf-8 encoded. So obviously if I try to manipulate the query string by inserting a utf-8 encoded string in the q= parameter the values are interpreted incorrectly and as such I cannot use this approach as a work- around. My sample query is: .. (the english word _canada_ translated into russian) or %D0%9A%D0%B0%D0%BD%D0%B0%D0%B4%D0%B0 (utf-8) or %26%231050%3B%26%231072%3B%26%231085%3B%26%231072%3B%26% 231076%3B%26%231072%3B (solr url encoding) I would appreciate any advice or suggestions that would allow me to search for cyrillics in solr. If anyone knows why solr is behaving as it does with the strange encoding, a brief explanation of what causes this behaviour could be helpful and what the encoding is (unicode?). If anyone else has force solr to accept utf-8 encoded q= parameters with success I would love to know how you did it. Thanks in advance! Tricia ps. I am using mozilla firefox as my main browser which leads to the behaviour I reported above. IE 6.0 works fine for cyrillics although there is still a strange but different encoding (%CA%E0%ED% E0%E4%E0 for the same query as before). -- Whirlycott Philip Jacob [EMAIL PROTECTED] http://www.whirlycott.com/phil/
Re: Cyrillic characters
On 7/18/06, WHIRLYCOTT <[EMAIL PROTECTED]> wrote: How much testing have people done using UTF-8 data on Solr? UTF-8 query *output* is well tested with Resin within CNET. Indexing UTF-8 is also well tested (again, mostly with Resin). UTF-8 query input is not really tested at all AFAIK (the q param to the standard request handler). -Yonik
Re: Cyrillic characters
OK, lets split up the indexing side from the query side for a moment and assume that you are indexing correctly (setting the content-type correctly, etc). I just added a new value to the multi-valued features field to the solr.xml example document: "Good unicode support: héllo (hello with an accent over the e)" or in the XML: Good unicode support: héllo (hello with an accent over the e) I used a numeric entity because post.sh doesn't specify any content-type (ascii or latin1 may be assumed). But as I said, let's assume things are indexed correctly for now. The URI standard says the following: '''When a new URI scheme defines a component that represents textual data consisting of characters from the Universal Character Set [UCS], the data should first be encoded as octets according to the UTF-8 character encoding [STD63]; then only those octets that do not correspond to characters in the unreserved set should be percent-encoded. For example, the character A would be represented as "A", the character LATIN CAPITAL LETTER A WITH GRAVE would be represented as "%C3%80", and the character KATAKANA LETTER A would be represented as "%E3%82%A2".''' http://www.gbiv.com/protocols/uri/rfc/rfc3986.html So, the unicode code point for the e with an accute accent is \u00E9. In UTF8 encoding it's a two byte sequence: 0xc3,0xa9 In both Firefox and IE, the following URI works fine to find the document: http://localhost:8983/solr/select/?stylesheet=&q=h%C3%A9llo If I try pasting héllo from notepad directly into the URL, IE works fine, but Firefox substitutes the accented e with %E9, which is incorrect. I haven't tried more complicated examples yet, and I haven't tried wget, etc, but things look like they are working as expected so far (with the exception of a firefox bug). -Yonik
Re: Cyrillic characters
: ps. I am using mozilla firefox as my main browser which leads to the : behaviour I reported above. IE 6.0 works fine for cyrillics although : there is still a strange but different encoding (%CA%E0%ED%E0%E4%E0 for : the same query as before). The problem may not be in the Solr internals as much as in the form on the admin screen -- i'm not on a computer where i can do any testing, but the problem may be that the tag in index.jsp/form.jsp doesn't specify any charset options, so the browser is making an assumption (and the Solr internals are making a different one) Another possibility is that this is "yet another jetty issue" Things I'd try if i had the time/resources: 1) Make a Junit test that executes the query you are trying -- this should rule out the possibility of a Lucene/SOlrCore bug 2) Try running SOlr in tomcat and see if that has the same problem. 3) Try adding an accept-charset param to the form on the admin screens and see if that fixes the problem. -Hoss
Re: Cyrillic characters
Definitely some Firefox bugs with UTF8 at least: If I go to the admin screen, and paste in héllo into the query box, then kill Solr and run netcat to see exactly what I get, it's the following: $ nc -l -p 8983 GET /solr/select/?stylesheet=&q=h%E9llo&version=2.1&start=0&rows=10&indent=on HT TP/1.1 Host: localhost:8983 User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.4) Gecko/20 060508 Firefox/1.5.0.4 Accept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plai n;q=0.8,image/png,*/*;q=0.5 Accept-Language: en-us,en;q=0.5 Accept-Encoding: gzip,deflate Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7 Keep-Alive: 300 Connection: keep-alive Referer: http://localhost:8983/solr/admin/ Cookie: JSESSIONID=3nqupchdew5mh URLs should be percent-encoded UTF-8 bytes, or at least UTF-8 bytes. ISO-latin1 isn't acceptable. -Yonik
Re: Cyrillic characters
I've started poking around and have fixed already one bug related to URL encoding of data. I'm going to work some more on this tonight and will hopefully have a patch for you soon. phil. On Jul 18, 2006, at 6:19 PM, Yonik Seeley wrote: On 7/18/06, WHIRLYCOTT <[EMAIL PROTECTED]> wrote: How much testing have people done using UTF-8 data on Solr? UTF-8 query *output* is well tested with Resin within CNET. Indexing UTF-8 is also well tested (again, mostly with Resin). UTF-8 query input is not really tested at all AFAIK (the q param to the standard request handler). -Yonik -- Whirlycott Philip Jacob [EMAIL PROTECTED] http://www.whirlycott.com/phil/
Re: Cyrillic characters
On 7/18/06, Tricia Williams <[EMAIL PROTECTED]> wrote: My sample query is: .. (the english word _canada_ translated into russian) or %D0%9A%D0%B0%D0%BD%D0%B0%D0%B4%D0%B0 (utf-8) or %26%231050%3B%26%231072%3B%26%231085%3B%26%231072%3B%26%231076%3B%26%231072%3B (solr url encoding) Hi Tricia, Could you clarify what you mean by "solr url encoding"? Where do you see this? The servlet container decodes URLs, and I'm not sure where in Solr that URLs are encoded. -Yonik
Re: Cyrillic characters
On Jul 18, 2006, at 5:53 PM, Tricia Williams wrote: that using the packaged example admin interface entering a query with a string of cyrillic characters causes a java.lang.ArrayIndexOutOfBoundsException ... I have this much fixed as well. However, I'm still walking data through the stack and I'm not yet convinced that my data is being stored properly as UTF-8 strings. It could be a character encoding issue in the client that I'm using to hit the /solr/update servlet or it could be something more insidious. But I need this stuff working for my own site (www.stylefeeder.com, in case you care...), so I will continue with this and report back. phil. -- Whirlycott Philip Jacob [EMAIL PROTECTED] http://www.whirlycott.com/phil/