Shawn - Stupid coding error in my java code. Used default charset. Changed to UTF-8 and problem fixed.
Thanks again! -----Original Message----- From: Tarala, Magesh Sent: Wednesday, July 08, 2015 8:11 PM To: solr-user@lucene.apache.org Subject: RE: Solr Encoding Issue? Wow, that makes total sense. Thanks Shawn!! I'll go down this path. Thanks, Magesh -----Original Message----- From: Shawn Heisey [mailto:apa...@elyograg.org] Sent: Wednesday, July 08, 2015 7:24 PM To: solr-user@lucene.apache.org Subject: Re: Solr Encoding Issue? On 7/8/2015 6:09 PM, Tarala, Magesh wrote: > I believe the issue is in solr. The character “à” is getting stored in solr > as “à ”. Notice the space after Ã. > > I'm using solrj to ingest the documents into solr. So, one of those could be > the culprit? Solr accepts and outputs text in UTF-8. The UTF-8 hex encoding for the à character is C3A0. In the latin1 character set, hex C3 is the à character. Similarly, in latin1, hex A0 is a non-breaking space. So it sounds like your input is encoded as UTF-8, therefore that character in your input source is hex c3a0, but something in your indexing process is incorrectly interpreting the UTF-8 representation as latin1, so it sees it as "à ". SolrJ is faithfully converting that input to UTF-8 and sending it to Solr. Thanks, Shawn