Okay. I should have realized from the original email. The input is
XML-encoded HTML. That's fine for a stored field that will be retrieved and
then displayed in a browser, but is NOT searchable.
What you will have to do is maintain two copies of that data, one stored in
HTML (the one your provided) for display only, not query, and a copy that is
stripped of HTML, which should also convert the entity codes to proper
Unicode accented character.
One approach:
1. Put the original text (HTML with entities for accented characters) in a
field named "features_html". This would be a stored="true" indexed="false"
field.
2. Add a copyField from "features_html" to "features".
3. Add an HTML strip char filter to the index analyzer for "features".
<charFilter class="solr.HTMLStripCharFilterFactory"/>
See:
http://lucene.apache.org/core/4_3_0/analyzers-common/org/apache/lucene/analysis/charfilter/HTMLStripCharFilterFactory.html
4. Make features stored="false" indexed="true".
Or, your input could contain both features_html and features and your
indexing client would strip the HTML tags and expand the entities for the
accented characters. And then you can return features for clean text with
accents.
Do you really want the HTML in Solr at all? For rich display it is
reasonable, but is that your requirement?
-- Jack Krupansky
-----Original Message-----
From: Shawn Heisey
Sent: Monday, May 20, 2013 1:52 PM
To: solr-user@lucene.apache.org
Subject: Re: Not able to search Spanish word with ascent in solr
On 5/20/2013 11:24 AM, jignesh wrote:
<response><lst name="responseHeader"><int name="status">0</int><int
name="QTime">1</int></lst><lst name="terms"><lst name="name"><int
name="a">716</int><int name="alt">509</int><int
name="aacute">384</int><int
name="as">260</int><int name="amp">176</int><int name="al">95</int><int
name="azul">67</int><int name="ahumado">61</int><int
name="and">60</int><int
name="acute">53</int></lst></lst></response>
Solr is indexing the encoded XML - so you are getting amp, acute,
aacute, and similar terms in your index.
Looking at the XML that you are indexing, it doesn't contain XML encoded
accented characters. It contains XML encoding of HTML encoding. As a
specific example, your XML file contains this:
&eacute;
The correct way to encode this would be the following:
é
There is a problem with this, however. This is HTML encoding, not XML
encoding. This fails when you try to index it in Solr:
Caused by: com.ctc.wstx.exc.WstxParsingException: Undeclared general
entity "eacute"
If I put the accented character right in the XML without the XML or HTML
encoding, it works correctly.
Thanks,
Shawn