On 5/20/2013 11:24 AM, jignesh wrote:
<response><lst name="responseHeader"><int name="status">0</int><int
name="QTime">1</int></lst><lst name="terms"><lst name="name"><int
name="a">716</int><int name="alt">509</int><int name="aacute">384</int><int
name="as">260</int><int name="amp">176</int><int name="al">95</int><int
name="azul">67</int><int name="ahumado">61</int><int name="and">60</int><int
name="acute">53</int></lst></lst></response>
Solr is indexing the encoded XML - so you are getting amp, acute,
aacute, and similar terms in your index.
Looking at the XML that you are indexing, it doesn't contain XML encoded
accented characters. It contains XML encoding of HTML encoding. As a
specific example, your XML file contains this:
&eacute;
The correct way to encode this would be the following:
é
There is a problem with this, however. This is HTML encoding, not XML
encoding. This fails when you try to index it in Solr:
Caused by: com.ctc.wstx.exc.WstxParsingException: Undeclared general
entity "eacute"
If I put the accented character right in the XML without the XML or HTML
encoding, it works correctly.
Thanks,
Shawn