Crap, you're right. I have a well-tested application that's using UTF-8 everywhere possible and I just tested with some Russian text. Solr's coughing up this as an exception:

Jul 18, 2006 6:00:05 PM org.apache.solr.core.SolrException log
SEVERE: java.lang.ArrayIndexOutOfBoundsException: 1
at org.apache.solr.search.QueryParsing.parseSort (QueryParsing.java:141) at org.apache.solr.request.StandardRequestHandler.handleRequest (StandardRequestHandler.java:96)
        at org.apache.solr.core.SolrCore.execute(SolrCore.java:592)
at org.apache.solr.servlet.SolrServlet.doGet (SolrServlet.java:94)
        at javax.servlet.http.HttpServlet.service(HttpServlet.java:596)
        at javax.servlet.http.HttpServlet.service(HttpServlet.java:689)
at org.mortbay.jetty.servlet.ServletHolder.handle (ServletHolder.java:428) at org.mortbay.jetty.servlet.WebApplicationHandler.dispatch (WebApplicationHandler.java:473) at org.mortbay.jetty.servlet.ServletHandler.handle (ServletHandler.java:568)
        at org.mortbay.http.HttpContext.handle(HttpContext.java:1530)
at org.mortbay.jetty.servlet.WebApplicationContext.handle (WebApplicationContext.java:633)
        at org.mortbay.http.HttpContext.handle(HttpContext.java:1482)
        at org.mortbay.http.HttpServer.service(HttpServer.java:909)
at org.mortbay.http.HttpConnection.service (HttpConnection.java:820) at org.mortbay.http.HttpConnection.handleNext (HttpConnection.java:986) at org.mortbay.http.HttpConnection.handle (HttpConnection.java:837) at org.mortbay.http.SocketListener.handleConnection (SocketListener.java:245) at org.mortbay.util.ThreadedServer.handle (ThreadedServer.java:357) at org.mortbay.util.ThreadPool$PoolThread.run (ThreadPool.java:534)

You're going directly against Solr/Jetty, right? Not proxied or mod_rewrite'd through to Apache?

Solr isn't properly encoding the data being received by the servlet. I think that I can fix this using some of the tricks that I've learned in building my site. More later.

How much testing have people done using UTF-8 data on Solr?

phil.



On Jul 18, 2006, at 5:53 PM, Tricia Williams wrote:

Hi all,

I'm trying to adapt our old cocoon/lucene based web search application to one that is more solrish. Our old web app was capable of searching for queries with cyrillic characters in them. I'm finding that using the packaged example admin interface entering a query with a string of cyrillic characters causes a java.lang.ArrayIndexOutOfBoundsException. I've also noted that the url built from the search form is not utf-8 encoded. So obviously if I try to manipulate the query string by inserting a utf-8 encoded string in the q= parameter the values are interpreted incorrectly and as such I cannot use this approach as a work- around. My sample query is: ...... (the english word _canada_ translated into russian) or %D0%9A%D0%B0%D0%BD%D0%B0%D0%B4%D0%B0 (utf-8) or %26%231050%3B%26%231072%3B%26%231085%3B%26%231072%3B%26% 231076%3B%26%231072%3B (solr url encoding)

I would appreciate any advice or suggestions that would allow me to search for cyrillics in solr. If anyone knows why solr is behaving as it does with the strange encoding, a brief explanation of what causes this behaviour could be helpful and what the encoding is (unicode?). If anyone else has force solr to accept utf-8 encoded q= parameters with success I would love to know how you did it.

Thanks in advance!
Tricia

ps. I am using mozilla firefox as my main browser which leads to the behaviour I reported above. IE 6.0 works fine for cyrillics although there is still a strange but different encoding (%CA%E0%ED% E0%E4%E0 for the same query as before).


--
                                   Whirlycott
                                   Philip Jacob
                                   [EMAIL PROTECTED]
                                   http://www.whirlycott.com/phil/


Reply via email to