Hi All, I'm trying to automate the process of posting xml s to Solr using Solrj. Essentially I'm extracting the text from a given Url, then creating a solrDoc and posting the same using the following function,
public void postToSolrUsingSolrj(String rawText, String pageId) { String url = "http://localhost:8983/solr";; CommonsHttpSolrServer server; try { // Get connection to Solr server server = new CommonsHttpSolrServer(url); // Set XMLResponseParser : Reqd for older version of Solr 1.3 server.setParser(new XMLResponseParser()); server.setSoTimeout(1000); // socket read timeout server.setConnectionTimeout(100); server.setDefaultMaxConnectionsPerHost(100); server.setMaxTotalConnections(100); server.setFollowRedirects(false); // defaults to false // allowCompression defaults to false. // Server side must support gzip or deflate for this to have any effect. server.setAllowCompression(true); server.setMaxRetries(1); // defaults to 0. > 1 not recommended. // WARNING : this will delete all pre-existing Solr index //server.deleteByQuery( "*:*" );// delete everything! SolrInputDocument doc = new SolrInputDocument(); doc.addField("id", pageId ); doc.addField("features", rawText ); // Add the docs to Solr Server server.add(doc); // Do commit the changes server.commit(); }catch (Exception e) {} } In the above the param rawText is just the html stripped off of all its tags, js, css etc and pageId is the Url for that page. When I'm using this for English pages its working perfectly fine but the problem comes up when I'm trying to index some non-english pages. For them, say pages in tamil, the encoding Unicode/Utf-8 seems to create some problem, because after indexing some non-english pages when I'm trying to search those from solr admin search interface, it gives the result but the content is not showing in that language i.e tamil rather it just displays just some characters, i think in unicode. The same thing worked fine for pages in English. Now what I did is just extracted the raw text from that html page and manually created an xml page like this <?xml version="1.0" encoding="UTF-8"?> <add> <doc> <field name="id">UTF2TEST</field> <field name="name">Test with some UTF-8 encoded characters</field> <field name="features">*some tamil unicode text here*</field> </doc> </add> and posted this from command line using the post.jar file. Now searching gives me the result but unlike last time browser shows the indexed text in tamil itself and not the raw unicode. So this clearly shows that the string that I'm using to create the solrDoc seems to have some encoding issues, right? Or something else? I tried doing something like this also, // Encode in Unicode UTF-8 utfEncodedText = new String(rawText.getBytes("UTF-8")); but even this didn't help eighter. Its seems some silly problem some where, which I'm not able to catch. :-) I appreciate if some one can point me the bug... Thanks, Ahmed.