RE: Solr Encoding Issue?

Tarala, Magesh Wed, 08 Jul 2015 18:12:00 -0700

Wow, that makes total sense. Thanks Shawn!! 

I'll go down this path.

Thanks,
Magesh

-----Original Message-----
From: Shawn Heisey [mailto:apa...@elyograg.org] 
Sent: Wednesday, July 08, 2015 7:24 PM
To: solr-user@lucene.apache.org
Subject: Re: Solr Encoding Issue?

On 7/8/2015 6:09 PM, Tarala, Magesh wrote:
> I believe the issue is in solr. The character “à” is getting stored in solr 
> as “Ã ”. Notice the space after Ã.
>
> I'm using solrj to ingest the documents into solr. So, one of those could be 
> the culprit?

Solr accepts and outputs text in UTF-8.  The UTF-8 hex encoding for the à 
character is C3A0.

In the latin1 character set, hex C3 is the Ã character.  Similarly, in latin1, 
hex A0 is a non-breaking space.

So it sounds like your input is encoded as UTF-8, therefore that character in 
your input source is hex c3a0, but something in your indexing process is 
incorrectly interpreting the UTF-8 representation as latin1, so it sees it as 
"Ã ".

SolrJ is faithfully converting that input to UTF-8 and sending it to Solr.

Thanks,
Shawn

RE: Solr Encoding Issue?

Reply via email to