> I recently upgraded from Solr 1.3 to Solr 3.1 in order to > take advantage of > the HTMLStripCharFilter. But it isn't working as I > expected. > > I have a text field that may contain HTML tags. I however > would like to > store it in Solr without the HTML tags. And retrieve the > text field for > display and for highlighting without HTML tags. > > I added <charFilter > class="solr.HTMLStripCharFilterFactory"/> to the top of > <fieldType name="text" class="solr.TextField" > positionIncrementGap="100" > autoGeneratePhraseQueries="true"> in the schema.xml file > of the solr > example, both in <analyzer type="index"> and in > <analyzer type="query">. > > And the text field is simply: > > <field name="text" type="text" indexed="true" > stored="true"/> > > Now, when I do a search. The text field still has all the > HTML tags in them > and the highlighting is totally screwed up with em tags > around virtually > every word. What am I doing wrong?
You need to strip html tag before analysis phase. If you are using DIH, you can use stripHTML="true" transformer.