Hi Mike, When I add the following test to TestHTMLStripCharFilterFactory.java on Solr trunk, it passes: public void testNumericCharacterEntities() throws Exception { final String text = "Bose® ™"; // |Bose® ™| HTMLStripCharFilterFactory htmlStripFactory = new HTMLStripCharFilterFactory(); htmlStripFactory.init(Collections.<String,String>emptyMap()); CharStream charStream = htmlStripFactory.create(CharReader.get(new StringReader(text))); StandardTokenizerFactory stdTokFactory = new StandardTokenizerFactory(); stdTokFactory.init(DEFAULT_VERSION_PARAM); Tokenizer stream = stdTokFactory.create(charStream); assertTokenStreamContents(stream, new String[] { "Bose" }); }
What's happening: First, htmlStripFactory converts "®" to "®" and "™" to "™". Then stdTokFactory declines to tokenize "®" and "™", because they are belong to the Unicode general category "Symbol, Other", and so are not included in any of the output tokens. StandardTokenizer uses the Word Break rules find UAX#29 <http://unicode.org/reports/tr29/> to find token boundaries, and then outputs only alphanumeric tokens. See the JFlex grammar for details: <http://svn.apache.org/viewvc/lucene/dev/trunk/modules/analysis/common/src/java/org/apache/lucene/analysis/standard/StandardTokenizerImpl.jflex?view=markup>. The behavior you're seeing is not consistent with the above test. Steve > -----Original Message----- > From: Mike Hugo [mailto:m...@piragua.com] > Sent: Tuesday, January 24, 2012 1:34 PM > To: solr-user@lucene.apache.org > Subject: HTMLStripCharFilterFactory not working in Solr4? > > We recently updated to the latest build of Solr4 and everything is working > really well so far! There is one case that is not working the same way it > was in Solr 3.4 - we strip out certain HTML constructs (like trademark and > registered, for example) in a field as defined below - it was working in > Solr3.4 with the configuration shown here, but is not working the same way > in Solr4. > > The label field is defined as type="text_general" > <field name="label" type="text_general" indexed="true" stored="false" > required="false" multiValued="true"/> > > Here's the type definition for text_general field: > <fieldType name="text_general" class="solr.TextField" > positionIncrementGap="100"> > <analyzer type="index"> > <tokenizer class="solr.StandardTokenizerFactory"/> > <charFilter class="solr.HTMLStripCharFilterFactory"/> > <filter class="solr.StopFilterFactory" ignoreCase="true" > words="stopwords.txt" > enablePositionIncrements="true"/> > <filter class="solr.LowerCaseFilterFactory"/> > </analyzer> > <analyzer type="query"> > <tokenizer class="solr.StandardTokenizerFactory"/> > <charFilter class="solr.HTMLStripCharFilterFactory"/> > <filter class="solr.StopFilterFactory" ignoreCase="true" > words="stopwords.txt" > enablePositionIncrements="true"/> > <filter class="solr.LowerCaseFilterFactory"/> > </analyzer> > </fieldType> > > > In Solr 3.4, that configuration was completely stripping html constructs > out of the indexed field which is exactly what we wanted. If for example, > we then do a facet on the label field, like in the test below, we're > getting some terms in the response that we would not like to be there. > > > // test case (groovy) > void specialHtmlConstructsGetStripped() { > SolrInputDocument inputDocument = new SolrInputDocument() > inputDocument.addField('label', 'Bose® ™') > > solrServer.add(inputDocument) > solrServer.commit() > > QueryResponse response = solrServer.query(new SolrQuery('bose')) > assert 1 == response.results.numFound > > SolrQuery facetQuery = new SolrQuery('bose') > facetQuery.facet = true > facetQuery.set(FacetParams.FACET_FIELD, 'label') > facetQuery.set(FacetParams.FACET_MINCOUNT, '1') > > response = solrServer.query(facetQuery) > FacetField ff = response.facetFields.find {it.name == 'label'} > > List suggestResponse = [] > > for (FacetField.Count facetField in ff?.values) { > suggestResponse << facetField.name > } > > assert suggestResponse == ['bose'] > } > > With the upgrade to Solr4, the assertion fails, the suggested response > contains 174 and 8482 as terms. Test output is: > > Assertion failed: > > assert suggestResponse == ['bose'] > | | > | false > [174, 8482, bose] > > > I just tried again using the latest build from today, namely: > https://builds.apache.org/job/Lucene-Solr-Maven-trunk/369/ and we're still > getting the failing assertion. Is there a different way to configure the > HTMLStripCharFilterFactory in Solr4? > > Thanks in advance for any tips! > > Mike