I tried a lot of things and almost am at my wit's end :(
Here is the code I used to get the strings - String htmlContent = readPage(page.getWebURL().getURL()); I even tried - Document doc = Jsoup.parse(new URL(url).openStream(), "UTF-8", url); String htmlContent = doc.html(); & Document doc = Jsoup.parse(htmlContent,"UTF-8"); No improvement so far, any advice for me please? function that gets the html ---------------------------------------- public static String readPage(String urlString) { try{ URL url = new URL(urlString); DefaultHttpClient client = new DefaultHttpClient(); client.getParams().setParameter(ClientPNames.COOKIE_POLICY, CookiePolicy.BROWSER_COMPATIBILITY); HttpGet request = new HttpGet(url.toURI()); HttpResponse response = client.execute(request); if(response.getStatusLine().getStatusCode() == 200 && response.getEntity().getContentType().toString().contains("text/html")) { Reader reader = null; try { reader = new InputStreamReader(response.getEntity().getContent()); StringBuffer sb = new StringBuffer(); { int read; char[] cbuf = new char[1024]; while ((read = reader.read(cbuf)) != -1) sb.append(cbuf, 0, read); } return sb.toString(); } finally { if (reader != null) { try { reader.close(); } catch (IOException e) { e.printStackTrace(); } } } } else return ""; }catch(Exception e){return "";} } --------------------------------------------------------------------------- On Wed, Nov 6, 2013 at 2:53 AM, T. Kuro Kurosaka <k...@healthline.com>wrote: > It sounds like the characters were mishandled at index build time. > I would use Luke to see if a character that appear correctly > when you change the output to be SHIFT JIS is actually > stored as one Unicode. I bet it's stored as two characters, > each having the character of the value that happened > to be high and low bytes of the SHIFT JIS character. > > There are many possible cause of this. If you are indexing > the HTML document from HTTP servers, HTTP server may > be configured to send wrong charset= info in Content-Type > header. If the document is directly from a file system, > and if the document doesn't have META header declaring > the charset, then the system assumes a default charset, > which is typically ISO-8859-1 or UTF-8, and misinterprets > SHIF-JIS encoded characters. > > You need to debug to find out where the characters > get corrupted. > > > On 11/04/2013 11:15 PM, Chris wrote: > >> Sorry, was away a bit & hence the delay. >> >> I am inserting java strings into a java bean class, and then doing a >> addBean() method to insert the POJO into Solr. >> >> When i Query using either tomcat/jetty, I get these special characters. >> But >> I have noted, if I change output to - "Shift-JIS" encoding then those >> characters appear as some japanese characters I think. >> >> But then this solution doesn't work for all special characters as I can >> still see some of them...isn't there an encoding that can cover all the >> characters whatever they might be? Any ideas on what do i do? >> >> Regards, >> Chris >> >> >> On Mon, Nov 4, 2013 at 6:27 PM, Erick Erickson <erickerick...@gmail.com> >> wrote: >> >> The problem is there are about a dozen places where the character >>> encoding can be mis-configured. The problem you're seeing above >>> actually looks like a problem with the character set configured in >>> your browser, it may have nothing to do with what's actually in Solr. >>> >>> You might write small SolrJ program and see if you can dump the contents >>> in binary and examine to see... >>> >>> Best >>> Erick >>> >>> >>> On Sun, Nov 3, 2013 at 6:39 AM, Rajani Maski <rajinima...@gmail.com> >>> wrote: >>> >>> How are you extracting the text that is there in the website[1] you are >>>> referring to? Apache Nutch or any other crawler? If yes, initially check >>>> whether that crawler engine is giving you data in correct format before >>>> >>> you >>> >>>> invoke solr index method. >>>> >>>> [1]http://blog.diigo.com/2009/09/28/scheduled-groups-maintenance/ >>>> >>>> URI encoding should resolve this problem. >>>> >>>> >>>> >>>> >>>> On Fri, Nov 1, 2013 at 10:50 AM, Chris <christu...@gmail.com> wrote: >>>> >>>> Hi Rajani, >>>>> >>>>> I followed the steps exactly as in >>>>> >>>>> >>>>> http://zensarteam.wordpress.com/2011/11/25/6-steps-to- >>> configure-solr-on-apache-tomcat-7-0-20/ >>> >>>> However, when i send a query to this new instance in tomcat, i again >>>>> >>>> get >>> >>>> the error - >>>>> >>>>> <str name="fulltxt">Scheduled Groups Maintenance >>>>> In preparation for the new release roll-out,���� Diigo groups won’t be >>>>> accessible on Sept 28 (Mon) around midnight 0:00 PST for several >>>>> hours. >>>>> Stay tuned to say hello to Diigo V4 soon! >>>>> >>>>> location of the text - >>>>> http://blog.diigo.com/2009/09/28/scheduled-groups-maintenance/ >>>>> >>>>> same problem at - http://cn.nytimes.com/business/20130926/c26alibaba/ >>>>> >>>>> All text in title comes like - >>>>> >>>>> ������������������������������������ - ��������������������� >>>>> ������������</str> >>>>> <arr name="text"> >>>>> <str>������������������������������������ - >>>>> ��������������������� ������������</str> >>>>> </arr> >>>>> >>>>> >>>>> Can you please advice? >>>>> >>>>> Chris >>>>> >>>>> >>>>> >>>>> >>>>> On Tue, Oct 29, 2013 at 11:33 PM, Rajani Maski <rajinima...@gmail.com >>>>> >>>>>> wrote: >>>>>> Hi, >>>>>> >>>>>> If you are using Apache Tomcat Server, hope you are not missing >>>>>> >>>>> the >>> >>>> below mentioned configuration: >>>>>> >>>>>> <Connector port=”port Number″ protocol=”HTTP/1.1″ >>>>>> connectionTimeout=”20000″ >>>>>> redirectPort=”8443″ *URIEncoding=”UTF-8″*/> >>>>>> >>>>>> I had faced similar issue with Chinese Characters and had resolved >>>>>> >>>>> with >>> >>>> the >>>>> >>>>>> above config. >>>>>> >>>>>> Links for reference : >>>>>> >>>>>> >>>>>> http://zensarteam.wordpress.com/2011/11/25/6-steps-to- >>> configure-solr-on-apache-tomcat-7-0-20/ >>> >>>> >>>>>> http://blog.sidu.in/2007/05/tomcat-and-utf-8-encoded-uri- >>> parameters.html#.Um_3P3Cw2X8 >>> >>>> >>>>>> Thanks >>>>>> >>>>>> >>>>>> >>>>>> On Tue, Oct 29, 2013 at 9:20 PM, Chris <christu...@gmail.com> wrote: >>>>>> >>>>>> Hi All, >>>>>>> >>>>>>> I get characters like - >>>>>>> >>>>>>> ������������������ - CTA������������ - >>>>>>> >>>>>>> in the solr index. I am adding Java beans to solr by the addBean() >>>>>>> function. >>>>>>> >>>>>>> This seems to be a character encoding issue. Any pointers on how to >>>>>>> resolve this one? >>>>>>> >>>>>>> I have seen that this occurs mostly for japanese chinese >>>>>>> >>>>>> characters. >>> >> > -- > ----------------------------------------- > T. "Kuro" Kurosaka • Senior Software Engineer > >