Re: Indexing UTF-8

Tricia Williams Thu, 10 Aug 2006 09:07:12 -0700

I no longer remember when or where this came up, but when using Tomcatthere is a known character encoding problem when you expect utf-8. InTomcat's $TOMCAT_HOME/conf/server.xml on the port you're running Solr onensure URIEncoding="UTF-8" is in

<Connector port="8080" URIEncoding="UTF-8" maxHttpHeaderSize="8192"
               maxThreads="150" minSpareThreads="25" maxSpareThreads="75"
               enableLookups="false" redirectPort="8443" acceptCount="100"
               connectionTimeout="20000" disableUploadTimeout="true"/>


This has solved some of my encoding problems.

Hope this helps,
Tricia

On Thu, 10 Aug 2006, Andrew May wrote:

Hi,

I'm trying to index some UTF-8 data, but I'm experiencing some problems.
I'm using the 28th July nightly build, which I believe contains all therecent fixes for making the administration webapp use UTF-8. I've triedrunning in both the provided Jetty instance and Tomcat 5.5.17.
I've indexed both using the post.sh script (i.e. curl) and HttpClient bothwith the same results.
I'm specifically concentrating on one author name that has been causingproblems:
Ayy??ld??z, Turhan
(I'm encoding this email as UTF-8 in the hope that comes through OK)

What I'm seeing coming back from Solr is:
Ayy????ld????z, Turhan
The undotted lowercase i Turkish character (U+0131) is instead appearing as alatin capital A with diaeresis (U+00C4) and a plus-minus character (U+00B1).
Using Luke to look at the index directly the field appears as:
Ayy??&#177;ld??&#177;z, Turhan
Which assuming Luke is displaying this correctly (± is ??) meanssomething happened in the posting of the data or the indexing.
I'm completely out of my depth when it comes to character encodings, so Idon't know whether I'm doing something stupid, mis-configuring something, orwhether this is a genuine problem not of my own making.
Any thoughts?

Thanks,

Andrew

Re: Indexing UTF-8

Reply via email to