Thanks for the response Yonik, Interestingly enough, changing to to the LegacyHTMLStripCharFilterFactory does NOT solve the problem - in fact I get the same result
I can see that the LegacyHTMLStripCharFilterFactory is being applied at startup: Jan 24, 2012 1:25:29 PM org.apache.solr.util.plugin.AbstractPluginLoader load INFO: created : org.apache.solr.analysis.LegacyHTMLStripCharFilterFactory however, I'm still getting the same assertion error. Any thoughts? Mike On Tue, Jan 24, 2012 at 12:40 PM, Yonik Seeley <yo...@lucidimagination.com>wrote: > You can use LegacyHTMLStripCharFilterFactory to get the previous behavior. > See https://issues.apache.org/jira/browse/LUCENE-3690 for more details. > > -Yonik > http://www.lucidimagination.com > > > > On Tue, Jan 24, 2012 at 1:34 PM, Mike Hugo <m...@piragua.com> wrote: > > We recently updated to the latest build of Solr4 and everything is > working > > really well so far! There is one case that is not working the same way > it > > was in Solr 3.4 - we strip out certain HTML constructs (like trademark > and > > registered, for example) in a field as defined below - it was working in > > Solr3.4 with the configuration shown here, but is not working the same > way > > in Solr4. > > > > The label field is defined as type="text_general" > > <field name="label" type="text_general" indexed="true" stored="false" > > required="false" multiValued="true"/> > > > > Here's the type definition for text_general field: > > <fieldType name="text_general" class="solr.TextField" > > positionIncrementGap="100"> > > <analyzer type="index"> > > <tokenizer class="solr.StandardTokenizerFactory"/> > > <charFilter class="solr.HTMLStripCharFilterFactory"/> > > <filter class="solr.StopFilterFactory" ignoreCase="true" > > words="stopwords.txt" > > enablePositionIncrements="true"/> > > <filter class="solr.LowerCaseFilterFactory"/> > > </analyzer> > > <analyzer type="query"> > > <tokenizer class="solr.StandardTokenizerFactory"/> > > <charFilter class="solr.HTMLStripCharFilterFactory"/> > > <filter class="solr.StopFilterFactory" ignoreCase="true" > > words="stopwords.txt" > > enablePositionIncrements="true"/> > > <filter class="solr.LowerCaseFilterFactory"/> > > </analyzer> > > </fieldType> > > > > > > In Solr 3.4, that configuration was completely stripping html constructs > > out of the indexed field which is exactly what we wanted. If for > example, > > we then do a facet on the label field, like in the test below, we're > > getting some terms in the response that we would not like to be there. > > > > > > // test case (groovy) > > void specialHtmlConstructsGetStripped() { > > SolrInputDocument inputDocument = new SolrInputDocument() > > inputDocument.addField('label', 'Bose® ™') > > > > solrServer.add(inputDocument) > > solrServer.commit() > > > > QueryResponse response = solrServer.query(new SolrQuery('bose')) > > assert 1 == response.results.numFound > > > > SolrQuery facetQuery = new SolrQuery('bose') > > facetQuery.facet = true > > facetQuery.set(FacetParams.FACET_FIELD, 'label') > > facetQuery.set(FacetParams.FACET_MINCOUNT, '1') > > > > response = solrServer.query(facetQuery) > > FacetField ff = response.facetFields.find {it.name == 'label'} > > > > List suggestResponse = [] > > > > for (FacetField.Count facetField in ff?.values) { > > suggestResponse << facetField.name > > } > > > > assert suggestResponse == ['bose'] > > } > > > > With the upgrade to Solr4, the assertion fails, the suggested response > > contains 174 and 8482 as terms. Test output is: > > > > Assertion failed: > > > > assert suggestResponse == ['bose'] > > | | > > | false > > [174, 8482, bose] > > > > > > I just tried again using the latest build from today, namely: > > https://builds.apache.org/job/Lucene-Solr-Maven-trunk/369/ and we're > still > > getting the failing assertion. Is there a different way to configure the > > HTMLStripCharFilterFactory in Solr4? > > > > Thanks in advance for any tips! > > > > Mike >