Problem with surrogate characters in utf-8

Burkamp, Christian Thu, 14 Jun 2007 02:47:11 -0700

Hi all,

I have a problem after updating to solr 1.2. I'm using the bundled jetty
that comes with the latest solr release.
Some of the contents that are stored in my index contain characters from
the unicode private section above 0x100000. (They are used by some
proprietary software and the text extraction does not throw them out).
Contrasting to solr 1.1, the current release returns these characters
coded as sequence of two surrogate characters. This could result from
some utf-16 conversion that is taking place somewhere in the system? In
fact a look into the index with luke suggests that lucene is storing
it's data in utf-16 encoding. The code point 0x100058 is stored as the
two surrogate characters 0xDBC0 and 0xDC58. This is the same behaviour
in solr 1.1 and 1.2. But while in solr 1.1 the character is put together
to form one 4-byte utf-8 character in the result, solr 1.2 returns the
utf-8 codes for the two surrogate characters that I see using luke.
Unfortunately this results in an invalid utf-8 encoded text that (for
example) can not be displayed by Internet Explorer.
A request like http://localhost:8983/solr/select?q=*:* results in an
error message from the browser.


This is easy to reproduce if someone would try to debug. I have attached
a valid utf-8 encoded xml document that contains the 4-byte encoded
codepoint 0x100058. It can be indexed with post.jar. Sending this
request via Internet Explorer now results in an error:
http://localhost:8983/solr/select?q=*:*

 <<utf.xml>> 
I tried the new solr 1.2 war file with the old example distribution
(solr 1.1 and jetty 5.1). Suprisingly enough this does not reveal the
problem. So the whole story might even be a jetty issue.

Any ideas?

-- Christian

<?xml version="1.0" encoding="UTF-8"?>
<add>
<doc>
<field name="id">UTF8TEST</field>
<field name="name">abcdefgôhijklmnop</field>
</doc>
</add>

Problem with surrogate characters in utf-8

Reply via email to