You've three analyzer elements, i wonder what that would do. You need to add the char filter to the index-time analyzer.
On Monday 25 July 2011 13:09:14 Merlin Morgenstern wrote: > Hi there, > > I am trying to strip html tags from the data before adding the documents to > the index. To do that I altered schem.xml like this: > > <fieldType name="text" class="solr.TextField" > positionIncrementGap="100" autoGeneratePhraseQueries="true"> > <analyzer type="index"> > <tokenizer class="solr.WhitespaceTokenizerFactory"/> > <filter class="solr.WordDelimiterFilterFactory" > generateWordParts="1" generateNumberParts="1" catenateWords="1" > catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/> > <filter class="solr.LowerCaseFilterFactory"/> > <filter class="solr.KeywordMarkerFilterFactory"/> > <filter class="solr.PorterStemFilterFactory"/> > </analyzer> > <analyzer type="query"> > <tokenizer class="solr.WhitespaceTokenizerFactory"/> > <filter class="solr.WordDelimiterFilterFactory" > generateWordParts="1" generateNumberParts="1" catenateWords="0" > catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/> > <filter class="solr.LowerCaseFilterFactory"/> > <filter class="solr.KeywordMarkerFilterFactory"/> > <filter class="solr.PorterStemFilterFactory"/> > </analyzer> > <analyzer> > <charFilter class="solr.HTMLStripCharFilterFactory"/> > <tokenizer class="solr.WhitespaceTokenizerFactory"/> > </analyzer> > </fieldType> > > <fields> > <field name="text" type="text" indexed="true" stored="true" > required="false"/> > </fields> > > Unfortunatelly this does not work, the hmtl tags like <h3> are still > present after restarting and reindexing. I also tryed > htmlstriptransformer, but this did not work either. > > Has anybody an idea how to get this done? Thank you in advance for any > hint. > > Merlin -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350