charFilters are executed first regardless of their position in the analyzer.
On Monday 25 July 2011 17:53:59 Mike Sokolov wrote: > I think you need to list the charfilter earlier in the analysis chain; > before the tokenizer. Porbably Solr should tell you this... > > -Mike > > On 07/25/2011 09:03 AM, Merlin Morgenstern wrote: > > sounds logical. I just changed it to the following, restarted and > > reindexed > > > > with commit: > > <fieldType name="text" class="solr.TextField" > > > > positionIncrementGap="100" autoGeneratePhraseQueries="true"> > > > > <analyzer type="index"> > > > > <tokenizer class="solr.WhitespaceTokenizerFactory"/> > > <filter class="solr.WordDelimiterFilterFactory" > > > > generateWordParts="1" generateNumberParts="1" catenateWords="1" > > catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/> > > > > <filter class="solr.LowerCaseFilterFactory"/> > > <filter class="solr.KeywordMarkerFilterFactory"/> > > <filter class="solr.PorterStemFilterFactory"/> > > <charFilter > > class="solr.HTMLStripCharFilterFactory"/> > > > > </analyzer> > > <analyzer type="query"> > > > > <tokenizer class="solr.WhitespaceTokenizerFactory"/> > > <filter class="solr.WordDelimiterFilterFactory" > > > > generateWordParts="1" generateNumberParts="1" catenateWords="0" > > catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/> > > > > <filter class="solr.LowerCaseFilterFactory"/> > > <filter class="solr.KeywordMarkerFilterFactory"/> > > <filter class="solr.PorterStemFilterFactory"/> > > <charFilter > > class="solr.HTMLStripCharFilterFactory"/> > > > > </analyzer> > > > > </fieldType> > > > > Unfortunatelly that did not fix the error. There are still<h3> tags > > inside the data. Although I believe there are viewer then before but I > > can not prove that. Fact is, there are still html tags inside the data. > > > > Any other ideas what the problem could be? > > > > > > > > > > > > 2011/7/25 Markus Jelsma<markus.jel...@openindex.io> > > > >> You've three analyzer elements, i wonder what that would do. You need to > >> add > >> the char filter to the index-time analyzer. > >> > >> On Monday 25 July 2011 13:09:14 Merlin Morgenstern wrote: > >>> Hi there, > >>> > >>> I am trying to strip html tags from the data before adding the > >>> documents > >> > >> to > >> > >>> the index. To do that I altered schem.xml like this: > >>> <fieldType name="text" class="solr.TextField" > >>> > >>> positionIncrementGap="100" autoGeneratePhraseQueries="true"> > >>> > >>> <analyzer type="index"> > >>> > >>> <tokenizer > >>> class="solr.WhitespaceTokenizerFactory"/> <filter > >>> class="solr.WordDelimiterFilterFactory" > >>> > >>> generateWordParts="1" generateNumberParts="1" catenateWords="1" > >>> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/> > >>> > >>> <filter class="solr.LowerCaseFilterFactory"/> > >>> <filter class="solr.KeywordMarkerFilterFactory"/> > >>> <filter class="solr.PorterStemFilterFactory"/> > >>> > >>> </analyzer> > >>> <analyzer type="query"> > >>> > >>> <tokenizer > >>> class="solr.WhitespaceTokenizerFactory"/> <filter > >>> class="solr.WordDelimiterFilterFactory" > >>> > >>> generateWordParts="1" generateNumberParts="1" catenateWords="0" > >>> catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/> > >>> > >>> <filter class="solr.LowerCaseFilterFactory"/> > >>> <filter class="solr.KeywordMarkerFilterFactory"/> > >>> <filter class="solr.PorterStemFilterFactory"/> > >>> > >>> </analyzer> > >>> <analyzer> > >>> > >>> <charFilter > >>> class="solr.HTMLStripCharFilterFactory"/> > >>> > >>> <tokenizer > >>> class="solr.WhitespaceTokenizerFactory"/> > >>> > >>> </analyzer> > >>> > >>> </fieldType> > >>> > >>> <fields> > >>> > >>> <field name="text" type="text" indexed="true" stored="true" > >>> > >>> required="false"/> > >>> > >>> </fields> > >>> > >>> Unfortunatelly this does not work, the hmtl tags like<h3> are still > >>> present after restarting and reindexing. I also tryed > >>> htmlstriptransformer, but this did not work either. > >>> > >>> Has anybody an idea how to get this done? Thank you in advance for any > >>> hint. > >>> > >>> Merlin > >> > >> -- > >> Markus Jelsma - CTO - Openindex > >> http://www.linkedin.com/in/markus17 > >> 050-8536620 / 06-50258350 -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350