Okay. I should have realized from the original email. The input is XML-encoded HTML. That's fine for a stored field that will be retrieved and then displayed in a browser, but is NOT searchable.

What you will have to do is maintain two copies of that data, one stored in HTML (the one your provided) for display only, not query, and a copy that is stripped of HTML, which should also convert the entity codes to proper Unicode accented character.

One approach:

1. Put the original text (HTML with entities for accented characters) in a field named "features_html". This would be a stored="true" indexed="false" field.
2. Add a copyField from "features_html" to "features".
3. Add an HTML strip char filter to the index analyzer for "features".

<charFilter class="solr.HTMLStripCharFilterFactory"/>

See:
http://lucene.apache.org/core/4_3_0/analyzers-common/org/apache/lucene/analysis/charfilter/HTMLStripCharFilterFactory.html

4. Make features stored="false" indexed="true".

Or, your input could contain both features_html and features and your indexing client would strip the HTML tags and expand the entities for the accented characters. And then you can return features for clean text with accents.

Do you really want the HTML in Solr at all? For rich display it is reasonable, but is that your requirement?

-- Jack Krupansky

-----Original Message----- From: Shawn Heisey
Sent: Monday, May 20, 2013 1:52 PM
To: solr-user@lucene.apache.org
Subject: Re: Not able to search Spanish word with ascent in solr

On 5/20/2013 11:24 AM, jignesh wrote:
<response><lst name="responseHeader"><int name="status">0</int><int
name="QTime">1</int></lst><lst name="terms"><lst name="name"><int
name="a">716</int><int name="alt">509</int><int name="aacute">384</int><int
name="as">260</int><int name="amp">176</int><int name="al">95</int><int
name="azul">67</int><int name="ahumado">61</int><int name="and">60</int><int
name="acute">53</int></lst></lst></response>

Solr is indexing the encoded XML - so you are getting amp, acute,
aacute, and similar terms in your index.

Looking at the XML that you are indexing, it doesn't contain XML encoded
accented characters.  It contains XML encoding of HTML encoding.  As a
specific example, your XML file contains this:

&amp;eacute;

The correct way to encode this would be the following:

&eacute;

There is a problem with this, however.  This is HTML encoding, not XML
encoding.  This fails when you try to index it in Solr:

Caused by: com.ctc.wstx.exc.WstxParsingException: Undeclared general
entity "eacute"

If I put the accented character right in the XML without the XML or HTML
encoding, it works correctly.

Thanks,
Shawn

Reply via email to