We recently updated to the latest build of Solr4 and everything is working
really well so far!  There is one case that is not working the same way it
was in Solr 3.4 - we strip out certain HTML constructs (like trademark and
registered, for example) in a field as defined below - it was working in
Solr3.4 with the configuration shown here, but is not working the same way
in Solr4.

The label field is defined as type="text_general"
<field name="label" type="text_general" indexed="true" stored="false"
required="false" multiValued="true"/>

Here's the type definition for text_general field:
<fieldType name="text_general" class="solr.TextField"
positionIncrementGap="100">
            <analyzer type="index">
                <tokenizer class="solr.StandardTokenizerFactory"/>
                <charFilter class="solr.HTMLStripCharFilterFactory"/>
                <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt"
                        enablePositionIncrements="true"/>
                <filter class="solr.LowerCaseFilterFactory"/>
            </analyzer>
            <analyzer type="query">
                <tokenizer class="solr.StandardTokenizerFactory"/>
                <charFilter class="solr.HTMLStripCharFilterFactory"/>
                <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt"
                        enablePositionIncrements="true"/>
                <filter class="solr.LowerCaseFilterFactory"/>
            </analyzer>
        </fieldType>


In Solr 3.4, that configuration was completely stripping html constructs
out of the indexed field which is exactly what we wanted.  If for example,
we then do a facet on the label field, like in the test below, we're
getting some terms in the response that we would not like to be there.


// test case (groovy)
void specialHtmlConstructsGetStripped() {
    SolrInputDocument inputDocument = new SolrInputDocument()
    inputDocument.addField('label', 'Bose&#174; &#8482;')

    solrServer.add(inputDocument)
    solrServer.commit()

    QueryResponse response = solrServer.query(new SolrQuery('bose'))
    assert 1 == response.results.numFound

    SolrQuery facetQuery = new SolrQuery('bose')
    facetQuery.facet = true
    facetQuery.set(FacetParams.FACET_FIELD, 'label')
    facetQuery.set(FacetParams.FACET_MINCOUNT, '1')

    response = solrServer.query(facetQuery)
    FacetField ff = response.facetFields.find {it.name == 'label'}

    List suggestResponse = []

    for (FacetField.Count facetField in ff?.values) {
        suggestResponse << facetField.name
    }

    assert suggestResponse == ['bose']
}

With the upgrade to Solr4, the assertion fails, the suggested response
contains 174 and 8482 as terms.  Test output is:

Assertion failed:

assert suggestResponse == ['bose']
       |               |
       |               false
       [174, 8482, bose]


I just tried again using the latest build from today, namely:
https://builds.apache.org/job/Lucene-Solr-Maven-trunk/369/ and we're still
getting the failing assertion. Is there a different way to configure the
HTMLStripCharFilterFactory in Solr4?

Thanks in advance for any tips!

Mike

Reply via email to