We recently updated to the latest build of Solr4 and everything is working really well so far! There is one case that is not working the same way it was in Solr 3.4 - we strip out certain HTML constructs (like trademark and registered, for example) in a field as defined below - it was working in Solr3.4 with the configuration shown here, but is not working the same way in Solr4.
The label field is defined as type="text_general" <field name="label" type="text_general" indexed="true" stored="false" required="false" multiValued="true"/> Here's the type definition for text_general field: <fieldType name="text_general" class="solr.TextField" positionIncrementGap="100"> <analyzer type="index"> <tokenizer class="solr.StandardTokenizerFactory"/> <charFilter class="solr.HTMLStripCharFilterFactory"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true"/> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> <analyzer type="query"> <tokenizer class="solr.StandardTokenizerFactory"/> <charFilter class="solr.HTMLStripCharFilterFactory"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true"/> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> </fieldType> In Solr 3.4, that configuration was completely stripping html constructs out of the indexed field which is exactly what we wanted. If for example, we then do a facet on the label field, like in the test below, we're getting some terms in the response that we would not like to be there. // test case (groovy) void specialHtmlConstructsGetStripped() { SolrInputDocument inputDocument = new SolrInputDocument() inputDocument.addField('label', 'Bose® ™') solrServer.add(inputDocument) solrServer.commit() QueryResponse response = solrServer.query(new SolrQuery('bose')) assert 1 == response.results.numFound SolrQuery facetQuery = new SolrQuery('bose') facetQuery.facet = true facetQuery.set(FacetParams.FACET_FIELD, 'label') facetQuery.set(FacetParams.FACET_MINCOUNT, '1') response = solrServer.query(facetQuery) FacetField ff = response.facetFields.find {it.name == 'label'} List suggestResponse = [] for (FacetField.Count facetField in ff?.values) { suggestResponse << facetField.name } assert suggestResponse == ['bose'] } With the upgrade to Solr4, the assertion fails, the suggested response contains 174 and 8482 as terms. Test output is: Assertion failed: assert suggestResponse == ['bose'] | | | false [174, 8482, bose] I just tried again using the latest build from today, namely: https://builds.apache.org/job/Lucene-Solr-Maven-trunk/369/ and we're still getting the failing assertion. Is there a different way to configure the HTMLStripCharFilterFactory in Solr4? Thanks in advance for any tips! Mike