Re: HTMLStripCharFilterFactory not working in Solr4?

Mike Hugo Wed, 25 Jan 2012 09:05:40 -0800

Thanks guys!  I'll grab the latest build from the solr4 jenkins server when
those commits get picked up and try it out.  Thanks for the quick
turnaround!


Mike

On Wed, Jan 25, 2012 at 11:01 AM, Steven A Rowe <sar...@syr.edu> wrote:

> Hi Mike,
>
> Yonik committed a fix to Solr trunk - your test on LUCENE-3721 succeeds
> for me now.  (On Solr trunk, *all* CharFilters have been non-functional
> since LUCENE-3396 was committed in r1175297 on 25 Sept 2011, until Yonik's
> fix today in r1235810; Solr 3.x was not affected - CharFilters have been
> working there all along.)
>
> Steve
>
> > -----Original Message-----
> > From: Mike Hugo [mailto:m...@piragua.com]
> > Sent: Tuesday, January 24, 2012 3:56 PM
> > To: solr-user@lucene.apache.org
> > Subject: Re: HTMLStripCharFilterFactory not working in Solr4?
> >
> > Thanks for the responses everyone.
> >
> > Steve, the test method you provided also works for me.  However, when I
> > try
> > a more end to end test with the HTMLStripCharFilterFactory configured for
> > a
> > field I am still having the same problem.  I attached a failing unit test
> > and configuration to the following issue in JIRA:
> >
> > https://issues.apache.org/jira/browse/LUCENE-3721
> >
> > I appreciate all the prompt responses!  Looking forward to finding the
> > root
> > cause of this guy :)  If there's something I'm doing incorrectly in the
> > configuration, please let me know!
> >
> > Mike
> >
> > On Tue, Jan 24, 2012 at 1:57 PM, Steven A Rowe <sar...@syr.edu> wrote:
> >
> > > Hi Mike,
> > >
> > > When I add the following test to TestHTMLStripCharFilterFactory.java on
> > > Solr trunk, it passes:
> > >
> > > public void testNumericCharacterEntities() throws Exception {
> > >  final String text = "Bose&#174; &#8482;";  // |Bose® ™|
> > >  HTMLStripCharFilterFactory htmlStripFactory = new
> > > HTMLStripCharFilterFactory();
> > >  htmlStripFactory.init(Collections.<String,String>emptyMap());
> > >  CharStream charStream = htmlStripFactory.create(CharReader.get(new
> > > StringReader(text)));
> > >  StandardTokenizerFactory stdTokFactory = new
> > StandardTokenizerFactory();
> > >  stdTokFactory.init(DEFAULT_VERSION_PARAM);
> > >  Tokenizer stream = stdTokFactory.create(charStream);
> > >  assertTokenStreamContents(stream, new String[] { "Bose" });
> > > }
> > >
> > > What's happening:
> > >
> > > First, htmlStripFactory converts "&#174;" to "®" and "&#8482;" to "™".
> > >  Then stdTokFactory declines to tokenize "®" and "™", because they are
> > > belong to the Unicode general category "Symbol, Other", and so are not
> > > included in any of the output tokens.
> > >
> > > StandardTokenizer uses the Word Break rules find UAX#29 <
> > > http://unicode.org/reports/tr29/> to find token boundaries, and then
> > > outputs only alphanumeric tokens.  See the JFlex grammar for details: <
> > >
> >
> http://svn.apache.org/viewvc/lucene/dev/trunk/modules/analysis/common/src/
> >
> java/org/apache/lucene/analysis/standard/StandardTokenizerImpl.jflex?view=
> > markup
> > > >.
> > >
> > > The behavior you're seeing is not consistent with the above test.
> > >
> > > Steve
> > >
> > > > -----Original Message-----
> > > > From: Mike Hugo [mailto:m...@piragua.com]
> > > > Sent: Tuesday, January 24, 2012 1:34 PM
> > > > To: solr-user@lucene.apache.org
> > > > Subject: HTMLStripCharFilterFactory not working in Solr4?
> > > >
> > > > We recently updated to the latest build of Solr4 and everything is
> > > working
> > > > really well so far!  There is one case that is not working the same
> > way
> > > it
> > > > was in Solr 3.4 - we strip out certain HTML constructs (like
> trademark
> > > and
> > > > registered, for example) in a field as defined below - it was working
> > in
> > > > Solr3.4 with the configuration shown here, but is not working the
> same
> > > way
> > > > in Solr4.
> > > >
> > > > The label field is defined as type="text_general"
> > > > <field name="label" type="text_general" indexed="true" stored="false"
> > > > required="false" multiValued="true"/>
> > > >
> > > > Here's the type definition for text_general field:
> > > > <fieldType name="text_general" class="solr.TextField"
> > > > positionIncrementGap="100">
> > > >             <analyzer type="index">
> > > >                 <tokenizer class="solr.StandardTokenizerFactory"/>
> > > >                 <charFilter class="solr.HTMLStripCharFilterFactory"/>
> > > >                 <filter class="solr.StopFilterFactory"
> > ignoreCase="true"
> > > > words="stopwords.txt"
> > > >                         enablePositionIncrements="true"/>
> > > >                 <filter class="solr.LowerCaseFilterFactory"/>
> > > >             </analyzer>
> > > >             <analyzer type="query">
> > > >                 <tokenizer class="solr.StandardTokenizerFactory"/>
> > > >                 <charFilter class="solr.HTMLStripCharFilterFactory"/>
> > > >                 <filter class="solr.StopFilterFactory"
> > ignoreCase="true"
> > > > words="stopwords.txt"
> > > >                         enablePositionIncrements="true"/>
> > > >                 <filter class="solr.LowerCaseFilterFactory"/>
> > > >             </analyzer>
> > > >         </fieldType>
> > > >
> > > >
> > > > In Solr 3.4, that configuration was completely stripping html
> > constructs
> > > > out of the indexed field which is exactly what we wanted.  If for
> > > example,
> > > > we then do a facet on the label field, like in the test below, we're
> > > > getting some terms in the response that we would not like to be
> there.
> > > >
> > > >
> > > > // test case (groovy)
> > > > void specialHtmlConstructsGetStripped() {
> > > >     SolrInputDocument inputDocument = new SolrInputDocument()
> > > >     inputDocument.addField('label', 'Bose&#174; &#8482;')
> > > >
> > > >     solrServer.add(inputDocument)
> > > >     solrServer.commit()
> > > >
> > > >     QueryResponse response = solrServer.query(new SolrQuery('bose'))
> > > >     assert 1 == response.results.numFound
> > > >
> > > >     SolrQuery facetQuery = new SolrQuery('bose')
> > > >     facetQuery.facet = true
> > > >     facetQuery.set(FacetParams.FACET_FIELD, 'label')
> > > >     facetQuery.set(FacetParams.FACET_MINCOUNT, '1')
> > > >
> > > >     response = solrServer.query(facetQuery)
> > > >     FacetField ff = response.facetFields.find {it.name == 'label'}
> > > >
> > > >     List suggestResponse = []
> > > >
> > > >     for (FacetField.Count facetField in ff?.values) {
> > > >         suggestResponse << facetField.name
> > > >     }
> > > >
> > > >     assert suggestResponse == ['bose']
> > > > }
> > > >
> > > > With the upgrade to Solr4, the assertion fails, the suggested
> response
> > > > contains 174 and 8482 as terms.  Test output is:
> > > >
> > > > Assertion failed:
> > > >
> > > > assert suggestResponse == ['bose']
> > > >        |               |
> > > >        |               false
> > > >        [174, 8482, bose]
> > > >
> > > >
> > > > I just tried again using the latest build from today, namely:
> > > > https://builds.apache.org/job/Lucene-Solr-Maven-trunk/369/ and we're
> > > still
> > > > getting the failing assertion. Is there a different way to configure
> > the
> > > > HTMLStripCharFilterFactory in Solr4?
> > > >
> > > > Thanks in advance for any tips!
> > > >
> > > > Mike
> > >
>

Re: HTMLStripCharFilterFactory not working in Solr4?

Reply via email to