sounds logical. I just changed it to the following, restarted and reindexed with commit:
<fieldType name="text" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="true"> <analyzer type="index"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.KeywordMarkerFilterFactory"/> <filter class="solr.PorterStemFilterFactory"/> <charFilter class="solr.HTMLStripCharFilterFactory"/> </analyzer> <analyzer type="query"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.KeywordMarkerFilterFactory"/> <filter class="solr.PorterStemFilterFactory"/> <charFilter class="solr.HTMLStripCharFilterFactory"/> </analyzer> </fieldType> Unfortunatelly that did not fix the error. There are still <h3> tags inside the data. Although I believe there are viewer then before but I can not prove that. Fact is, there are still html tags inside the data. Any other ideas what the problem could be? 2011/7/25 Markus Jelsma <markus.jel...@openindex.io> > You've three analyzer elements, i wonder what that would do. You need to > add > the char filter to the index-time analyzer. > > On Monday 25 July 2011 13:09:14 Merlin Morgenstern wrote: > > Hi there, > > > > I am trying to strip html tags from the data before adding the documents > to > > the index. To do that I altered schem.xml like this: > > > > <fieldType name="text" class="solr.TextField" > > positionIncrementGap="100" autoGeneratePhraseQueries="true"> > > <analyzer type="index"> > > <tokenizer class="solr.WhitespaceTokenizerFactory"/> > > <filter class="solr.WordDelimiterFilterFactory" > > generateWordParts="1" generateNumberParts="1" catenateWords="1" > > catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/> > > <filter class="solr.LowerCaseFilterFactory"/> > > <filter class="solr.KeywordMarkerFilterFactory"/> > > <filter class="solr.PorterStemFilterFactory"/> > > </analyzer> > > <analyzer type="query"> > > <tokenizer class="solr.WhitespaceTokenizerFactory"/> > > <filter class="solr.WordDelimiterFilterFactory" > > generateWordParts="1" generateNumberParts="1" catenateWords="0" > > catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/> > > <filter class="solr.LowerCaseFilterFactory"/> > > <filter class="solr.KeywordMarkerFilterFactory"/> > > <filter class="solr.PorterStemFilterFactory"/> > > </analyzer> > > <analyzer> > > <charFilter class="solr.HTMLStripCharFilterFactory"/> > > <tokenizer class="solr.WhitespaceTokenizerFactory"/> > > </analyzer> > > </fieldType> > > > > <fields> > > <field name="text" type="text" indexed="true" stored="true" > > required="false"/> > > </fields> > > > > Unfortunatelly this does not work, the hmtl tags like <h3> are still > > present after restarting and reindexing. I also tryed > > htmlstriptransformer, but this did not work either. > > > > Has anybody an idea how to get this done? Thank you in advance for any > > hint. > > > > Merlin > > -- > Markus Jelsma - CTO - Openindex > http://www.linkedin.com/in/markus17 > 050-8536620 / 06-50258350 >