You can use LegacyHTMLStripCharFilterFactory to get the previous behavior. See https://issues.apache.org/jira/browse/LUCENE-3690 for more details.
-Yonik http://www.lucidimagination.com On Tue, Jan 24, 2012 at 1:34 PM, Mike Hugo <m...@piragua.com> wrote: > We recently updated to the latest build of Solr4 and everything is working > really well so far! There is one case that is not working the same way it > was in Solr 3.4 - we strip out certain HTML constructs (like trademark and > registered, for example) in a field as defined below - it was working in > Solr3.4 with the configuration shown here, but is not working the same way > in Solr4. > > The label field is defined as type="text_general" > <field name="label" type="text_general" indexed="true" stored="false" > required="false" multiValued="true"/> > > Here's the type definition for text_general field: > <fieldType name="text_general" class="solr.TextField" > positionIncrementGap="100"> > <analyzer type="index"> > <tokenizer class="solr.StandardTokenizerFactory"/> > <charFilter class="solr.HTMLStripCharFilterFactory"/> > <filter class="solr.StopFilterFactory" ignoreCase="true" > words="stopwords.txt" > enablePositionIncrements="true"/> > <filter class="solr.LowerCaseFilterFactory"/> > </analyzer> > <analyzer type="query"> > <tokenizer class="solr.StandardTokenizerFactory"/> > <charFilter class="solr.HTMLStripCharFilterFactory"/> > <filter class="solr.StopFilterFactory" ignoreCase="true" > words="stopwords.txt" > enablePositionIncrements="true"/> > <filter class="solr.LowerCaseFilterFactory"/> > </analyzer> > </fieldType> > > > In Solr 3.4, that configuration was completely stripping html constructs > out of the indexed field which is exactly what we wanted. If for example, > we then do a facet on the label field, like in the test below, we're > getting some terms in the response that we would not like to be there. > > > // test case (groovy) > void specialHtmlConstructsGetStripped() { > SolrInputDocument inputDocument = new SolrInputDocument() > inputDocument.addField('label', 'Bose® ™') > > solrServer.add(inputDocument) > solrServer.commit() > > QueryResponse response = solrServer.query(new SolrQuery('bose')) > assert 1 == response.results.numFound > > SolrQuery facetQuery = new SolrQuery('bose') > facetQuery.facet = true > facetQuery.set(FacetParams.FACET_FIELD, 'label') > facetQuery.set(FacetParams.FACET_MINCOUNT, '1') > > response = solrServer.query(facetQuery) > FacetField ff = response.facetFields.find {it.name == 'label'} > > List suggestResponse = [] > > for (FacetField.Count facetField in ff?.values) { > suggestResponse << facetField.name > } > > assert suggestResponse == ['bose'] > } > > With the upgrade to Solr4, the assertion fails, the suggested response > contains 174 and 8482 as terms. Test output is: > > Assertion failed: > > assert suggestResponse == ['bose'] > | | > | false > [174, 8482, bose] > > > I just tried again using the latest build from today, namely: > https://builds.apache.org/job/Lucene-Solr-Maven-trunk/369/ and we're still > getting the failing assertion. Is there a different way to configure the > HTMLStripCharFilterFactory in Solr4? > > Thanks in advance for any tips! > > Mike