Indexing UTF-8

Andrew May Thu, 10 Aug 2006 08:17:51 -0700

Hi,

I'm trying to index some UTF-8 data, but I'm experiencing some problems.

I'm using the 28th July nightly build, which I believe contains all the recent fixes formaking the administration webapp use UTF-8. I've tried running in both the provided Jettyinstance and Tomcat 5.5.17.

I've indexed both using the post.sh script (i.e. curl) and HttpClient both with the sameresults.


I'm specifically concentrating on one author name that has been causing 
problems:
Ayyıldız, Turhan
(I'm encoding this email as UTF-8 in the hope that comes through OK)

What I'm seeing coming back from Solr is:
AyyÄ±ldÄ±z, Turhan

The undotted lowercase i Turkish character (U+0131) is instead appearing as a latincapital A with diaeresis (U+00C4) and a plus-minus character (U+00B1).


Using Luke to look at the index directly the field appears as:
AyyÄ&#177;ldÄ&#177;z, Turhan

Which assuming Luke is displaying this correctly (± is ±) means something happened inthe posting of the data or the indexing.

I'm completely out of my depth when it comes to character encodings, so I don't knowwhether I'm doing something stupid, mis-configuring something, or whether this is agenuine problem not of my own making.


Any thoughts?

Thanks,

Andrew

Indexing UTF-8

Reply via email to