Re: Not able to search Spanish word with ascent in solr

Jack Krupansky Mon, 20 May 2013 11:10:44 -0700

Okay. I should have realized from the original email. The input isXML-encoded HTML. That's fine for a stored field that will be retrieved andthen displayed in a browser, but is NOT searchable.

What you will have to do is maintain two copies of that data, one stored inHTML (the one your provided) for display only, not query, and a copy that isstripped of HTML, which should also convert the entity codes to properUnicode accented character.


One approach:

1. Put the original text (HTML with entities for accented characters) in afield named "features_html". This would be a stored="true" indexed="false"field.

2. Add a copyField from "features_html" to "features".
3. Add an HTML strip char filter to the index analyzer for "features".

<charFilter class="solr.HTMLStripCharFilterFactory"/>

See:
http://lucene.apache.org/core/4_3_0/analyzers-common/org/apache/lucene/analysis/charfilter/HTMLStripCharFilterFactory.html

4. Make features stored="false" indexed="true".

Or, your input could contain both features_html and features and yourindexing client would strip the HTML tags and expand the entities for theaccented characters. And then you can return features for clean text withaccents.

Do you really want the HTML in Solr at all? For rich display it isreasonable, but is that your requirement?


-- Jack Krupansky

-----Original Message-----From: Shawn Heisey

Sent: Monday, May 20, 2013 1:52 PM
To: solr-user@lucene.apache.org
Subject: Re: Not able to search Spanish word with ascent in solr

On 5/20/2013 11:24 AM, jignesh wrote:

<response><lst name="responseHeader"><int name="status">0</int><int
name="QTime">1</int></lst><lst name="terms"><lst name="name"><int
name="a">716</int><int name="alt">509</int><intname="aacute">384</int><int
name="as">260</int><int name="amp">176</int><int name="al">95</int><int
name="azul">67</int><int name="ahumado">61</int><intname="and">60</int><int
name="acute">53</int></lst></lst></response>


Solr is indexing the encoded XML - so you are getting amp, acute,
aacute, and similar terms in your index.

Looking at the XML that you are indexing, it doesn't contain XML encoded
accented characters.  It contains XML encoding of HTML encoding.  As a
specific example, your XML file contains this:

&amp;eacute;

The correct way to encode this would be the following:

&eacute;

There is a problem with this, however.  This is HTML encoding, not XML
encoding.  This fails when you try to index it in Solr:

Caused by: com.ctc.wstx.exc.WstxParsingException: Undeclared general
entity "eacute"

If I put the accented character right in the XML without the XML or HTML
encoding, it works correctly.

Thanks,

Shawn

Re: Not able to search Spanish word with ascent in solr

Reply via email to