Cyrillic characters

2006-07-18 Thread Tricia Williams

Hi all,

   I'm trying to adapt our old cocoon/lucene based web search application 
to one that is more solrish.  Our old web app was capable of searching for 
queries with cyrillic characters in them.  I'm finding that using the 
packaged example admin interface entering a query with a string of 
cyrillic characters causes a java.lang.ArrayIndexOutOfBoundsException. 
I've also noted that the url built from the search form is not utf-8 
encoded.  So obviously if I try to manipulate the query string by 
inserting a utf-8 encoded string in the q= parameter the values are 
interpreted incorrectly and as such I cannot use this approach as a 
work-around.  My sample query is: .. (the english word _canada_ 
translated into russian) or 
%D0%9A%D0%B0%D0%BD%D0%B0%D0%B4%D0%B0 (utf-8) or 
%26%231050%3B%26%231072%3B%26%231085%3B%26%231072%3B%26%231076%3B%26%231072%3B 
(solr url encoding)


   I would appreciate any advice or suggestions that would allow me 
to search for cyrillics in solr.  If anyone knows why solr is behaving as 
it does with the strange encoding, a brief explanation of what causes this 
behaviour could be helpful and what the encoding is (unicode?).  If anyone 
else has force solr to accept utf-8 encoded q= parameters with success I 
would love to know how you did it.


Thanks in advance!
Tricia

ps.  I am using mozilla firefox as my main browser which leads to the 
behaviour I reported above.  IE 6.0 works fine for cyrillics although 
there is still a strange but different encoding (%CA%E0%ED%E0%E4%E0 for 
the same query as before).


Re: Cyrillic characters

2006-07-18 Thread WHIRLYCOTT
Crap, you're right.  I have a well-tested application that's using  
UTF-8 everywhere possible and I just tested with some Russian text.   
Solr's coughing up this as an exception:


Jul 18, 2006 6:00:05 PM org.apache.solr.core.SolrException log
SEVERE: java.lang.ArrayIndexOutOfBoundsException: 1
at org.apache.solr.search.QueryParsing.parseSort 
(QueryParsing.java:141)
at  
org.apache.solr.request.StandardRequestHandler.handleRequest 
(StandardRequestHandler.java:96)

at org.apache.solr.core.SolrCore.execute(SolrCore.java:592)
at org.apache.solr.servlet.SolrServlet.doGet 
(SolrServlet.java:94)

at javax.servlet.http.HttpServlet.service(HttpServlet.java:596)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:689)
at org.mortbay.jetty.servlet.ServletHolder.handle 
(ServletHolder.java:428)
at org.mortbay.jetty.servlet.WebApplicationHandler.dispatch 
(WebApplicationHandler.java:473)
at org.mortbay.jetty.servlet.ServletHandler.handle 
(ServletHandler.java:568)

at org.mortbay.http.HttpContext.handle(HttpContext.java:1530)
at org.mortbay.jetty.servlet.WebApplicationContext.handle 
(WebApplicationContext.java:633)

at org.mortbay.http.HttpContext.handle(HttpContext.java:1482)
at org.mortbay.http.HttpServer.service(HttpServer.java:909)
at org.mortbay.http.HttpConnection.service 
(HttpConnection.java:820)
at org.mortbay.http.HttpConnection.handleNext 
(HttpConnection.java:986)
at org.mortbay.http.HttpConnection.handle 
(HttpConnection.java:837)
at org.mortbay.http.SocketListener.handleConnection 
(SocketListener.java:245)
at org.mortbay.util.ThreadedServer.handle 
(ThreadedServer.java:357)
at org.mortbay.util.ThreadPool$PoolThread.run 
(ThreadPool.java:534)


You're going directly against Solr/Jetty, right?  Not proxied or  
mod_rewrite'd through to Apache?


Solr isn't properly encoding the data being received by the servlet.   
I think that I can fix this using some of the tricks that I've  
learned in building my site.  More later.


How much testing have people done using UTF-8 data on Solr?

phil.



On Jul 18, 2006, at 5:53 PM, Tricia Williams wrote:


Hi all,

   I'm trying to adapt our old cocoon/lucene based web search  
application to one that is more solrish.  Our old web app was  
capable of searching for queries with cyrillic characters in them.   
I'm finding that using the packaged example admin interface  
entering a query with a string of cyrillic characters causes a  
java.lang.ArrayIndexOutOfBoundsException. I've also noted that the  
url built from the search form is not utf-8 encoded.  So obviously  
if I try to manipulate the query string by inserting a utf-8  
encoded string in the q= parameter the values are interpreted  
incorrectly and as such I cannot use this approach as a work- 
around.  My sample query is: .. (the english word _canada_  
translated into russian) or %D0%9A%D0%B0%D0%BD%D0%B0%D0%B4%D0%B0  
(utf-8) or %26%231050%3B%26%231072%3B%26%231085%3B%26%231072%3B%26% 
231076%3B%26%231072%3B (solr url encoding)


   I would appreciate any advice or suggestions that would allow me  
to search for cyrillics in solr.  If anyone knows why solr is  
behaving as it does with the strange encoding, a brief explanation  
of what causes this behaviour could be helpful and what the  
encoding is (unicode?).  If anyone else has force solr to accept  
utf-8 encoded q= parameters with success I would love to know how  
you did it.


Thanks in advance!
Tricia

ps.  I am using mozilla firefox as my main browser which leads to  
the behaviour I reported above.  IE 6.0 works fine for cyrillics  
although there is still a strange but different encoding (%CA%E0%ED% 
E0%E4%E0 for the same query as before).



--
   Whirlycott
   Philip Jacob
   [EMAIL PROTECTED]
   http://www.whirlycott.com/phil/




Re: Cyrillic characters

2006-07-18 Thread Yonik Seeley

On 7/18/06, WHIRLYCOTT <[EMAIL PROTECTED]> wrote:

How much testing have people done using UTF-8 data on Solr?


UTF-8 query *output* is well tested with Resin within CNET.
Indexing UTF-8 is also well tested (again, mostly with Resin).
UTF-8 query input is not really tested at all AFAIK (the q param to
the standard request handler).

-Yonik


Re: Cyrillic characters

2006-07-18 Thread Yonik Seeley

OK, lets split up the indexing side from the query side for a moment
and assume that you are indexing correctly (setting the content-type
correctly, etc).

I just added a new value to the multi-valued features field to the
solr.xml example document:
 "Good unicode support: héllo (hello with an accent over the e)"
or in the XML:
 Good unicode support: héllo (hello with
an accent over the e)

I used a numeric entity because post.sh doesn't specify any
content-type (ascii or latin1 may be assumed).  But as I said, let's
assume things are indexed correctly for now.

The URI standard says the following:
'''When a new URI scheme defines a component that represents textual
data consisting of characters from the Universal Character Set [UCS],
the data should first be encoded as octets according to the UTF-8
character encoding [STD63]; then only those octets that do not
correspond to characters in the unreserved set should be
percent-encoded. For example, the character A would be represented as
"A", the character LATIN CAPITAL LETTER A WITH GRAVE would be
represented as "%C3%80", and the character KATAKANA LETTER A would be
represented as "%E3%82%A2".'''

http://www.gbiv.com/protocols/uri/rfc/rfc3986.html

So, the unicode code point for the e with an accute accent is \u00E9.
In UTF8 encoding it's a two byte sequence: 0xc3,0xa9

In both Firefox and IE, the following URI works fine to find the document:
http://localhost:8983/solr/select/?stylesheet=&q=h%C3%A9llo

If I try pasting héllo from notepad directly into the URL, IE works
fine, but Firefox substitutes the accented e with %E9, which is
incorrect.

I haven't tried more complicated examples yet, and I haven't tried
wget, etc, but things look like they are working as expected so far
(with the exception of a firefox bug).

-Yonik


Re: Cyrillic characters

2006-07-18 Thread Chris Hostetter

: ps.  I am using mozilla firefox as my main browser which leads to the
: behaviour I reported above.  IE 6.0 works fine for cyrillics although
: there is still a strange but different encoding (%CA%E0%ED%E0%E4%E0 for
: the same query as before).

The problem may not be in the Solr internals as much as in the form on the
admin screen -- i'm not on a computer where i can do any testing, but the
problem may be that the  tag in index.jsp/form.jsp doesn't specify
any charset options, so the browser is making an assumption (and the Solr
internals are making a different one)

Another possibility is that this is "yet another jetty issue"

Things I'd try if i had the time/resources:

1) Make a Junit test that executes the query you are trying -- this should
rule out the possibility of a Lucene/SOlrCore bug

2) Try running SOlr in tomcat and see if that has the same problem.

3) Try adding an accept-charset param to the form on the admin screens and
see if that fixes the problem.



-Hoss



Re: Cyrillic characters

2006-07-18 Thread Yonik Seeley

Definitely some Firefox bugs with UTF8 at least:
If I go to the admin screen, and paste in héllo into the query box,
then kill Solr and run netcat to see exactly what I get, it's the
following:

$ nc -l -p 8983
GET /solr/select/?stylesheet=&q=h%E9llo&version=2.1&start=0&rows=10&indent=on HT
TP/1.1
Host: localhost:8983
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.4) Gecko/20
060508 Firefox/1.5.0.4
Accept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plai
n;q=0.8,image/png,*/*;q=0.5
Accept-Language: en-us,en;q=0.5
Accept-Encoding: gzip,deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Keep-Alive: 300
Connection: keep-alive
Referer: http://localhost:8983/solr/admin/
Cookie: JSESSIONID=3nqupchdew5mh


URLs should be percent-encoded UTF-8 bytes, or at least UTF-8 bytes.
ISO-latin1 isn't acceptable.

-Yonik


Re: Cyrillic characters

2006-07-18 Thread WHIRLYCOTT
I've started poking around and have fixed already one bug related to  
URL encoding of data.  I'm going to work some more on this tonight  
and will hopefully have a patch for you soon.


phil.

On Jul 18, 2006, at 6:19 PM, Yonik Seeley wrote:


On 7/18/06, WHIRLYCOTT <[EMAIL PROTECTED]> wrote:

How much testing have people done using UTF-8 data on Solr?


UTF-8 query *output* is well tested with Resin within CNET.
Indexing UTF-8 is also well tested (again, mostly with Resin).
UTF-8 query input is not really tested at all AFAIK (the q param to
the standard request handler).

-Yonik



--
   Whirlycott
   Philip Jacob
   [EMAIL PROTECTED]
   http://www.whirlycott.com/phil/




Re: Cyrillic characters

2006-07-18 Thread Yonik Seeley

On 7/18/06, Tricia Williams <[EMAIL PROTECTED]> wrote:

 My sample query is: .. (the english word _canada_
translated into russian) or
%D0%9A%D0%B0%D0%BD%D0%B0%D0%B4%D0%B0 (utf-8) or
%26%231050%3B%26%231072%3B%26%231085%3B%26%231072%3B%26%231076%3B%26%231072%3B
(solr url encoding)


Hi Tricia,
Could you clarify what you mean by "solr url encoding"?  Where do you see this?
The servlet container decodes URLs, and I'm not sure where in Solr
that URLs are encoded.

-Yonik


Re: Cyrillic characters

2006-07-18 Thread WHIRLYCOTT

On Jul 18, 2006, at 5:53 PM, Tricia Williams wrote:

that using the packaged example admin interface entering a query  
with a string of cyrillic characters causes a  
java.lang.ArrayIndexOutOfBoundsException


... I have this much fixed as well.

However, I'm still walking data through the stack and I'm not yet  
convinced that my data is being stored properly as UTF-8 strings.  It  
could be a character encoding issue in the client that I'm using to  
hit the /solr/update servlet or it could be something more insidious.


But I need this stuff working for my own site (www.stylefeeder.com,  
in case you care...), so I will continue with this and report back.


phil.


--
   Whirlycott
   Philip Jacob
   [EMAIL PROTECTED]
   http://www.whirlycott.com/phil/