Looks like you have a double encoding problem.

It might be because you fetch UTF-8 binary data from mysql (I know
that for instance the perl driver has an issue with that) and you then
encode it a second time in UTF-8 when you post to solr.

Make sure the string you're getting from mysql are actually proper
unicode strings and not the raw UTF-8 encoded binary form.

You may want to have a look at
http://dev.mysql.com/doc/refman/5.0/en/connector-j-reference-charsets.html
for the proper option to use with your connection.

What you can try to check you're posting actual UTF-8 data to solr is
to dump your xml post in a file (don't forget to set the input
encoding to UTF-8 ). Then you can check if this file is readable with
any UTF-8 aware editor.

Cheers,

Jerome.


On Tue, Oct 21, 2008 at 10:43 AM, sunnyfr <[EMAIL PROTECTED]> wrote:
>
> Hi,
>
> I've solr 1.3 and tomcat55.
> When I try to index a bit of data and I request ALL, obviously my accent and
> UTF8 encoding is not took in consideration.
> <doc>
> <date name="created">2006-12-14T15:28:27Z</date>
> <str name="description_ja">
> Le 1er film de Goro Miyazaki (fils de Hayao)
> <br />je suis allÃ(c)e  ...
> ....
> <str name="title_ja">渡邊 å‰ å·  vs 三ç"°ä¸‹ç"° 1</str>
>
>
> My database Mysql is well in UTF8, if I request data manually from mysql I
> will get accent even japan characters properly
>
> I index my data, my data-config is :
>  <dataSource type="JdbcDataSource"
>              driver="com.mysql.jdbc.Driver"
>              url="jdbc:mysql://master-spare.videos.com/videos"
>              user="solr"
>              password="pass"
>              batchSize="-1"
>              responseBuffering="adaptive"/>
>
> My schema config file start by : <?xml version="1.0" encoding="UTF-8" ?>
>
> I've add in my server.xml : because my localhost point on 8180
>    <Connector port="8180" maxHttpHeaderSize="8192"
>               maxThreads="150" minSpareThreads="25" maxSpareThreads="75"
>               enableLookups="false" redirectPort="8443" acceptCount="100"
>               connectionTimeout="20000" disableUploadTimeout="true"
> URIEncoding="UTF-8" useBodyEncodingForURI="true" />
>
> What can I check?
> I'm using a linux server.
> If I do dpkg-reconfigure -plow locales
> Generating locales...
>  fr_BE.UTF-8... up-to-date
>  fr_CA.UTF-8... up-to-date
>  fr_CH.UTF-8... up-to-date
>  fr_FR.UTF-8... up-to-date
>  fr_LU.UTF-8... up-to-date
> Generation complete.
>
> Would that be a problem, I would say no but maybe, do I miss a package???
>
>
>
> --
> View this message in context: 
> http://www.nabble.com/tomcat55-solr1.3---Indexing-data%2C-doesnt-take-in-consideration-utf8%21-tp20086167p20086167.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>



-- 
Jerome Eteve.

Chat with me live at http://www.eteve.net

[EMAIL PROTECTED]

Reply via email to