Re: character encoding issue...

Chris Sat, 09 Nov 2013 06:21:47 -0800

I tried a lot of things and almost am at my wit's end :(


Here is the code I used to get the strings -

String htmlContent = readPage(page.getWebURL().getURL());

I even tried -
Document doc = Jsoup.parse(new URL(url).openStream(), "UTF-8", url);
        String htmlContent = doc.html();

& Document doc = Jsoup.parse(htmlContent,"UTF-8");

No improvement so far, any advice for me please?



function that gets the html ----------------------------------------
 public static String readPage(String urlString)  {
             try{

           URL url = new URL(urlString);
             DefaultHttpClient client = new DefaultHttpClient();
             client.getParams().setParameter(ClientPNames.COOKIE_POLICY,
                     CookiePolicy.BROWSER_COMPATIBILITY);

             HttpGet request = new HttpGet(url.toURI());
             HttpResponse response = client.execute(request);

             if(response.getStatusLine().getStatusCode() == 200 &&
response.getEntity().getContentType().toString().contains("text/html"))
             {
                 Reader reader = null;
                 try {
                     reader = new
InputStreamReader(response.getEntity().getContent());

                     StringBuffer sb = new StringBuffer();
                     {
                         int read;
                         char[] cbuf = new char[1024];
                         while ((read = reader.read(cbuf)) != -1)
                             sb.append(cbuf, 0, read);
                     }

                     return sb.toString();

                 } finally {
                     if (reader != null) {
                         try {
                             reader.close();
                         } catch (IOException e) {
                             e.printStackTrace();
                         }
                    }
                 }
             }
             else
                 return "";

             }catch(Exception e){return "";}

         }

---------------------------------------------------------------------------



On Wed, Nov 6, 2013 at 2:53 AM, T. Kuro Kurosaka <k...@healthline.com>wrote:

> It sounds like the characters were mishandled at index build time.
> I would use Luke to see if a character that appear correctly
> when you change the output to be SHIFT JIS is actually
> stored as one Unicode. I bet it's stored as two characters,
> each having the character of the value that happened
> to be high and low bytes of the SHIFT JIS character.
>
> There are many possible cause of this. If you are indexing
> the HTML document from HTTP servers, HTTP server may
> be configured to send wrong charset= info in Content-Type
> header. If the document is directly from a file system,
> and if the document doesn't  have META header declaring
> the charset, then the system assumes a default charset,
> which is typically ISO-8859-1 or UTF-8, and misinterprets
> SHIF-JIS encoded characters.
>
> You need to debug to find out where the characters
> get corrupted.
>
>
> On 11/04/2013 11:15 PM, Chris wrote:
>
>> Sorry, was away a bit & hence the delay.
>>
>> I am inserting java strings into a java bean class, and then doing a
>> addBean() method to insert the POJO into Solr.
>>
>> When i Query using either tomcat/jetty, I get these special characters.
>> But
>> I have noted, if I change output to - "Shift-JIS" encoding then those
>> characters appear as some japanese characters I think.
>>
>> But then this solution doesn't work for all special characters as I can
>> still see some of them...isn't there an encoding that can cover all the
>> characters whatever they might be? Any ideas on what do i do?
>>
>> Regards,
>> Chris
>>
>>
>> On Mon, Nov 4, 2013 at 6:27 PM, Erick Erickson <erickerick...@gmail.com>
>> wrote:
>>
>>  The problem is there are about a dozen places where the character
>>> encoding can be mis-configured. The problem you're seeing above
>>> actually looks like a problem with the character set configured in
>>> your browser, it may have nothing to do with what's actually in Solr.
>>>
>>> You might write small SolrJ program and see if you can dump the contents
>>> in binary and examine to see...
>>>
>>> Best
>>> Erick
>>>
>>>
>>> On Sun, Nov 3, 2013 at 6:39 AM, Rajani Maski <rajinima...@gmail.com>
>>> wrote:
>>>
>>>  How are you extracting the text that is there in the website[1] you are
>>>> referring to? Apache Nutch or any other crawler? If yes, initially check
>>>> whether that crawler engine is giving you data in correct format before
>>>>
>>> you
>>>
>>>> invoke solr index method.
>>>>
>>>> [1]http://blog.diigo.com/2009/09/28/scheduled-groups-maintenance/
>>>>
>>>> URI encoding should resolve this problem.
>>>>
>>>>
>>>>
>>>>
>>>> On Fri, Nov 1, 2013 at 10:50 AM, Chris <christu...@gmail.com> wrote:
>>>>
>>>>  Hi Rajani,
>>>>>
>>>>> I followed the steps exactly as in
>>>>>
>>>>>
>>>>>  http://zensarteam.wordpress.com/2011/11/25/6-steps-to-
>>> configure-solr-on-apache-tomcat-7-0-20/
>>>
>>>> However, when i send a query to this new instance in tomcat, i again
>>>>>
>>>> get
>>>
>>>> the error -
>>>>>
>>>>>    <str name="fulltxt">Scheduled Groups Maintenance
>>>>> In preparation for the new release roll-out,���� Diigo groups won’t be
>>>>> accessible on Sept 28 (Mon) around midnight 0:00 PST for several
>>>>> hours.
>>>>> Stay tuned to say hello to Diigo V4 soon!
>>>>>
>>>>> location of the text  -
>>>>> http://blog.diigo.com/2009/09/28/scheduled-groups-maintenance/
>>>>>
>>>>> same problem at - http://cn.nytimes.com/business/20130926/c26alibaba/
>>>>>
>>>>> All text in title comes like -
>>>>>
>>>>> ������������������������������������ - ���������������������
>>>>> ������������</str>
>>>>>      <arr name="text">
>>>>>        <str>������������������������������������ -
>>>>> ��������������������� ������������</str>
>>>>>      </arr>
>>>>>
>>>>>
>>>>> Can you please advice?
>>>>>
>>>>> Chris
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Oct 29, 2013 at 11:33 PM, Rajani Maski <rajinima...@gmail.com
>>>>>
>>>>>> wrote:
>>>>>> Hi,
>>>>>>
>>>>>>     If you are using Apache Tomcat Server, hope you are not missing
>>>>>>
>>>>> the
>>>
>>>> below mentioned configuration:
>>>>>>
>>>>>>   <Connector port=”port Number″ protocol=”HTTP/1.1″
>>>>>> connectionTimeout=”20000″
>>>>>> redirectPort=”8443″ *URIEncoding=”UTF-8″*/>
>>>>>>
>>>>>> I had faced similar issue with Chinese Characters and had resolved
>>>>>>
>>>>> with
>>>
>>>> the
>>>>>
>>>>>> above config.
>>>>>>
>>>>>> Links for reference :
>>>>>>
>>>>>>
>>>>>>  http://zensarteam.wordpress.com/2011/11/25/6-steps-to-
>>> configure-solr-on-apache-tomcat-7-0-20/
>>>
>>>>
>>>>>>  http://blog.sidu.in/2007/05/tomcat-and-utf-8-encoded-uri-
>>> parameters.html#.Um_3P3Cw2X8
>>>
>>>>
>>>>>> Thanks
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Oct 29, 2013 at 9:20 PM, Chris <christu...@gmail.com> wrote:
>>>>>>
>>>>>>  Hi All,
>>>>>>>
>>>>>>> I get characters like -
>>>>>>>
>>>>>>> ������������������ - CTA������������ -
>>>>>>>
>>>>>>> in the solr index. I am adding Java beans to solr by the addBean()
>>>>>>> function.
>>>>>>>
>>>>>>> This seems to be a character encoding issue. Any pointers on how to
>>>>>>> resolve this one?
>>>>>>>
>>>>>>> I have seen that this occurs  mostly for japanese chinese
>>>>>>>
>>>>>> characters.
>>>
>>
> --
> -----------------------------------------
> T. "Kuro" Kurosaka • Senior Software Engineer
>
>

Re: character encoding issue...

Reply via email to