Also, your browser may use a platform default for the encoding instead of UTF-8. Some MacOS and Windows browsers have this problem.
Tomcat sometimes needs adjustment to use UTF-8. If you are on tomcat, check this: http://find.searchhub.org/link?url=http://wiki.apache.org/solr/SolrTomcat http://find.searchhub.org/?q=utf-8#%2Fp%3Asolr%2Fs%3Alucid%2Cwiki ----- Original Message ----- | From: "Gora Mohanty" <g...@mimirtech.com> | To: solr-user@lucene.apache.org | Sent: Thursday, September 6, 2012 7:13:40 PM | Subject: Re: Importing of unix date format from mysql database and dates of format 'Thu, 06 Sep 2012 22:32:33 +0000' | in Solr 4.0 | | On 7 September 2012 06:24, kiran chitturi <chitturikira...@gmail.com> | wrote: | [...] | | > When i index a text field which has arabic and English like this | > tweet | > “@anaga3an: هو سعد الحريري بيعمل ايه غير تحديد الدوجلاس ويختار | > الكرافته ؟؟” | > #gcc #ksa #lebanon #syria #kuwait #egypt #سوريا | > with field_type as 'text_ar' and when i try to see the same field | > again in | > solr, it is shown as below. | > RT @AhmedWagih: لو معملناش ØØ§Ø¬Ø© Ù�ÙŠ الزيادة | > السكانية Ù�ÙŠ مصر، هنتØÙˆÙ„ لدولة Ù�قيرة | > كثيÙ�Ø© السكان زي بنجلادش #Egypt #EgyEconomy | > | > both of the lines do not mean the same, but i have just placed them | > here as | > an example. This was the problem i am facing. | > | [...] | | The encoding of your input text is being mangled at some point. | Presuming that your original encoding is UTF-8, I would look at | how you are indexing into Solr, and the encoding settings on the | Java container. Solr itself handles UTF-8 perfectly fine, as do | most Java containers if configured properly, so my first suspicion | would be the indexing code. | | As it looks like you are pulling from mysql using DIH, check that | the database character set is UTF-8, and that the connection uses | UTF-8. | | Regards, | Gora |