Re: solr.HTMLStripCharFilterFactory issue

Big Gosh Mon, 02 Sep 2019 07:17:12 -0700

Thank you for your answer, very clear and precise.



On Mon, 2 Sep 2019 at 15:52, Erick Erickson <erickerick...@gmail.com> wrote:

> This is expected behavior, assuming you’re asking for your stored field as
> part of the “fl” list.
>
> The default behavior is to store the raw input and return it unaltered.
> The stored data is recorded before _any_ analysis, including charFilters.
> Otherwise it’d be surprising to see, say, the original text with all the
> accents removed (to use another CharFilter as an example).
>
> If you want the returned text to not include the markup, use an
> UpdateProcessorFactory in your update chain. These modify the input before
> the data is stored. For instance:
>
>
> https://lucene.apache.org/solr/7_6_0//solr-core/org/apache/solr/update/processor/HTMLStripFieldUpdateProcessorFactory.html
>
> It’s not obvious from the desctiption unless you follow the link to the
> superclass that you can specify one or more fields too, see:
>
> https://lucene.apache.org/solr/7_6_0//solr-core/org/apache/solr/update/processor/FieldMutatingUpdateProcessorFactory.html
>
> Best,
> Erick
>
> > On Sep 2, 2019, at 9:30 AM, Big Gosh <bigg...@gmail.com> wrote:
> >
> > Hi,
> >
> > I've configured in solr 8.2.0 a field type as follows:
> >
> > <fieldType name="text_html" class="solr.TextField"
> > positionIncrementGap="100" multiValued="true">
> >      <analyzer type="index">
> >        <charFilter class="solr.HTMLStripCharFilterFactory"/>
> >        <tokenizer class="solr.StandardTokenizerFactory"/>
> >        <filter class="solr.StopFilterFactory" ignoreCase="true"
> > words="stopwords.txt" />
> >        <!-- in this example, we will only use synonyms at query time
> >        <filter class="solr.SynonymGraphFilterFactory"
> > synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
> >        <filter class="solr.FlattenGraphFilterFactory"/>
> >        -->
> >        <filter class="solr.LowerCaseFilterFactory"/>
> >      </analyzer>
> >      <analyzer type="query">
> >        <tokenizer class="solr.StandardTokenizerFactory"/>
> >        <filter class="solr.StopFilterFactory" ignoreCase="true"
> > words="stopwords.txt" />
> >        <filter class="solr.SynonymGraphFilterFactory"
> > synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
> >        <filter class="solr.LowerCaseFilterFactory"/>
> >      </analyzer>
> >    </fieldType>
> >
> > I expected that the search returns the field stripped, instead HTML tags
> > are still in the field.
> >
> > Is this correct or I made a mistake in configuration
> >
> > I'm quite sure in the past I used this approach to strip html from the
> text
> >
> > Thanks in advance
>
>

Re: solr.HTMLStripCharFilterFactory issue

Reply via email to