Re: solr.HTMLStripCharFilterFactory issue

Erick Erickson Mon, 02 Sep 2019 06:52:57 -0700

This is expected behavior, assuming you’re asking for your stored field as part 
of the “fl” list.


The default behavior is to store the raw input and return it unaltered. The 
stored data is recorded before _any_ analysis, including charFilters. Otherwise 
it’d be surprising to see, say, the original text with all the accents removed 
(to use another CharFilter as an example).

If you want the returned text to not include the markup, use an 
UpdateProcessorFactory in your update chain. These modify the input before the 
data is stored. For instance: 

https://lucene.apache.org/solr/7_6_0//solr-core/org/apache/solr/update/processor/HTMLStripFieldUpdateProcessorFactory.html

It’s not obvious from the desctiption unless you follow the link to the 
superclass that you can specify one or more fields too, see:
https://lucene.apache.org/solr/7_6_0//solr-core/org/apache/solr/update/processor/FieldMutatingUpdateProcessorFactory.html

Best,
Erick

> On Sep 2, 2019, at 9:30 AM, Big Gosh <bigg...@gmail.com> wrote:
> 
> Hi,
> 
> I've configured in solr 8.2.0 a field type as follows:
> 
> <fieldType name="text_html" class="solr.TextField"
> positionIncrementGap="100" multiValued="true">
>      <analyzer type="index">
>        <charFilter class="solr.HTMLStripCharFilterFactory"/>
>        <tokenizer class="solr.StandardTokenizerFactory"/>
>        <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt" />
>        <!-- in this example, we will only use synonyms at query time
>        <filter class="solr.SynonymGraphFilterFactory"
> synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
>        <filter class="solr.FlattenGraphFilterFactory"/>
>        -->
>        <filter class="solr.LowerCaseFilterFactory"/>
>      </analyzer>
>      <analyzer type="query">
>        <tokenizer class="solr.StandardTokenizerFactory"/>
>        <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt" />
>        <filter class="solr.SynonymGraphFilterFactory"
> synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
>        <filter class="solr.LowerCaseFilterFactory"/>
>      </analyzer>
>    </fieldType>
> 
> I expected that the search returns the field stripped, instead HTML tags
> are still in the field.
> 
> Is this correct or I made a mistake in configuration
> 
> I'm quite sure in the past I used this approach to strip html from the text
> 
> Thanks in advance

Re: solr.HTMLStripCharFilterFactory issue

Reply via email to