Re: character encoding issue...

T. Kuro Kurosaka Tue, 05 Nov 2013 13:25:09 -0800

It sounds like the characters were mishandled at index build time.
I would use Luke to see if a character that appear correctly
when you change the output to be SHIFT JIS is actually
stored as one Unicode. I bet it's stored as two characters,
each having the character of the value that happened
to be high and low bytes of the SHIFT JIS character.


There are many possible cause of this. If you are indexing
the HTML document from HTTP servers, HTTP server may
be configured to send wrong charset= info in Content-Type
header. If the document is directly from a file system,
and if the document doesn't  have META header declaring
the charset, then the system assumes a default charset,
which is typically ISO-8859-1 or UTF-8, and misinterprets
SHIF-JIS encoded characters.

You need to debug to find out where the characters
get corrupted.

On 11/04/2013 11:15 PM, Chris wrote:

Sorry, was away a bit & hence the delay.

I am inserting java strings into a java bean class, and then doing a
addBean() method to insert the POJO into Solr.

When i Query using either tomcat/jetty, I get these special characters. But
I have noted, if I change output to - "Shift-JIS" encoding then those
characters appear as some japanese characters I think.

But then this solution doesn't work for all special characters as I can
still see some of them...isn't there an encoding that can cover all the
characters whatever they might be? Any ideas on what do i do?

Regards,
Chris


On Mon, Nov 4, 2013 at 6:27 PM, Erick Erickson <[email protected]>wrote:

The problem is there are about a dozen places where the character
encoding can be mis-configured. The problem you're seeing above
actually looks like a problem with the character set configured in
your browser, it may have nothing to do with what's actually in Solr.

You might write small SolrJ program and see if you can dump the contents
in binary and examine to see...

Best
Erick


On Sun, Nov 3, 2013 at 6:39 AM, Rajani Maski <[email protected]>
wrote:

How are you extracting the text that is there in the website[1] you are
referring to? Apache Nutch or any other crawler? If yes, initially check
whether that crawler engine is giving you data in correct format before

you

invoke solr index method.

[1]http://blog.diigo.com/2009/09/28/scheduled-groups-maintenance/

URI encoding should resolve this problem.




On Fri, Nov 1, 2013 at 10:50 AM, Chris <[email protected]> wrote:

Hi Rajani,

I followed the steps exactly as in

http://zensarteam.wordpress.com/2011/11/25/6-steps-to-configure-solr-on-apache-tomcat-7-0-20/

However, when i send a query to this new instance in tomcat, i again

get

the error -

   <str name="fulltxt">Scheduled Groups Maintenance
In preparation for the new release roll-out,���� Diigo groups won’t be
accessible on Sept 28 (Mon) around midnight 0:00 PST for several
hours.
Stay tuned to say hello to Diigo V4 soon!

location of the text  -
http://blog.diigo.com/2009/09/28/scheduled-groups-maintenance/

same problem at - http://cn.nytimes.com/business/20130926/c26alibaba/

All text in title comes like -

������������������������������������ - ���������������������
������������</str>
     <arr name="text">
       <str>������������������������������������ -
��������������������� ������������</str>
     </arr>


Can you please advice?

Chris




On Tue, Oct 29, 2013 at 11:33 PM, Rajani Maski <[email protected]

wrote:
Hi,

    If you are using Apache Tomcat Server, hope you are not missing

the

below mentioned configuration:

  <Connector port=”port Number″ protocol=”HTTP/1.1″
connectionTimeout=”20000″
redirectPort=”8443″ *URIEncoding=”UTF-8″*/>

I had faced similar issue with Chinese Characters and had resolved

with

the

above config.

Links for reference :

http://zensarteam.wordpress.com/2011/11/25/6-steps-to-configure-solr-on-apache-tomcat-7-0-20/

http://blog.sidu.in/2007/05/tomcat-and-utf-8-encoded-uri-parameters.html#.Um_3P3Cw2X8


Thanks



On Tue, Oct 29, 2013 at 9:20 PM, Chris <[email protected]> wrote:

Hi All,

I get characters like -

������������������ - CTA������������ -

in the solr index. I am adding Java beans to solr by the addBean()
function.

This seems to be a character encoding issue. Any pointers on how to
resolve this one?

I have seen that this occurs  mostly for japanese chinese

characters.


--
-----------------------------------------
T. "Kuro" Kurosaka • Senior Software Engineer

Re: character encoding issue...

Reply via email to