Re: HTMLStripCharFilterFactory not working in Solr4?

Mike Hugo Tue, 24 Jan 2012 11:28:45 -0800

Thanks for the response Yonik,
Interestingly enough, changing to to the LegacyHTMLStripCharFilterFactory
does NOT solve the problem - in fact I get the same result


I can see that the LegacyHTMLStripCharFilterFactory is being applied at
startup:

Jan 24, 2012 1:25:29 PM org.apache.solr.util.plugin.AbstractPluginLoader
load
INFO: created : org.apache.solr.analysis.LegacyHTMLStripCharFilterFactory

however, I'm still getting the same assertion error.  Any thoughts?

Mike


On Tue, Jan 24, 2012 at 12:40 PM, Yonik Seeley
<yo...@lucidimagination.com>wrote:

> You can use LegacyHTMLStripCharFilterFactory to get the previous behavior.
> See https://issues.apache.org/jira/browse/LUCENE-3690 for more details.
>
> -Yonik
> http://www.lucidimagination.com
>
>
>
> On Tue, Jan 24, 2012 at 1:34 PM, Mike Hugo <m...@piragua.com> wrote:
> > We recently updated to the latest build of Solr4 and everything is
> working
> > really well so far!  There is one case that is not working the same way
> it
> > was in Solr 3.4 - we strip out certain HTML constructs (like trademark
> and
> > registered, for example) in a field as defined below - it was working in
> > Solr3.4 with the configuration shown here, but is not working the same
> way
> > in Solr4.
> >
> > The label field is defined as type="text_general"
> > <field name="label" type="text_general" indexed="true" stored="false"
> > required="false" multiValued="true"/>
> >
> > Here's the type definition for text_general field:
> > <fieldType name="text_general" class="solr.TextField"
> > positionIncrementGap="100">
> >            <analyzer type="index">
> >                <tokenizer class="solr.StandardTokenizerFactory"/>
> >                <charFilter class="solr.HTMLStripCharFilterFactory"/>
> >                <filter class="solr.StopFilterFactory" ignoreCase="true"
> > words="stopwords.txt"
> >                        enablePositionIncrements="true"/>
> >                <filter class="solr.LowerCaseFilterFactory"/>
> >            </analyzer>
> >            <analyzer type="query">
> >                <tokenizer class="solr.StandardTokenizerFactory"/>
> >                <charFilter class="solr.HTMLStripCharFilterFactory"/>
> >                <filter class="solr.StopFilterFactory" ignoreCase="true"
> > words="stopwords.txt"
> >                        enablePositionIncrements="true"/>
> >                <filter class="solr.LowerCaseFilterFactory"/>
> >            </analyzer>
> >        </fieldType>
> >
> >
> > In Solr 3.4, that configuration was completely stripping html constructs
> > out of the indexed field which is exactly what we wanted.  If for
> example,
> > we then do a facet on the label field, like in the test below, we're
> > getting some terms in the response that we would not like to be there.
> >
> >
> > // test case (groovy)
> > void specialHtmlConstructsGetStripped() {
> >    SolrInputDocument inputDocument = new SolrInputDocument()
> >    inputDocument.addField('label', 'Bose&#174; &#8482;')
> >
> >    solrServer.add(inputDocument)
> >    solrServer.commit()
> >
> >    QueryResponse response = solrServer.query(new SolrQuery('bose'))
> >    assert 1 == response.results.numFound
> >
> >    SolrQuery facetQuery = new SolrQuery('bose')
> >    facetQuery.facet = true
> >    facetQuery.set(FacetParams.FACET_FIELD, 'label')
> >    facetQuery.set(FacetParams.FACET_MINCOUNT, '1')
> >
> >    response = solrServer.query(facetQuery)
> >    FacetField ff = response.facetFields.find {it.name == 'label'}
> >
> >    List suggestResponse = []
> >
> >    for (FacetField.Count facetField in ff?.values) {
> >        suggestResponse << facetField.name
> >    }
> >
> >    assert suggestResponse == ['bose']
> > }
> >
> > With the upgrade to Solr4, the assertion fails, the suggested response
> > contains 174 and 8482 as terms.  Test output is:
> >
> > Assertion failed:
> >
> > assert suggestResponse == ['bose']
> >       |               |
> >       |               false
> >       [174, 8482, bose]
> >
> >
> > I just tried again using the latest build from today, namely:
> > https://builds.apache.org/job/Lucene-Solr-Maven-trunk/369/ and we're
> still
> > getting the failing assertion. Is there a different way to configure the
> > HTMLStripCharFilterFactory in Solr4?
> >
> > Thanks in advance for any tips!
> >
> > Mike
>

Re: HTMLStripCharFilterFactory not working in Solr4?

Reply via email to