Re: strip html from data

Mike Sokolov Mon, 25 Jul 2011 08:54:38 -0700

I think you need to list the charfilter earlier in the analysis chain;before the tokenizer. Porbably Solr should tell you this...


-Mike


On 07/25/2011 09:03 AM, Merlin Morgenstern wrote:

sounds logical. I just changed it to the following, restarted and reindexed
with commit:

          <fieldType name="text" class="solr.TextField"
positionIncrementGap="100" autoGeneratePhraseQueries="true">
                 <analyzer type="index">
                     <tokenizer class="solr.WhitespaceTokenizerFactory"/>
                     <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="1"
catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
                     <filter class="solr.LowerCaseFilterFactory"/>
                     <filter class="solr.KeywordMarkerFilterFactory"/>
                     <filter class="solr.PorterStemFilterFactory"/>
                     <charFilter class="solr.HTMLStripCharFilterFactory"/>
                 </analyzer>
                 <analyzer type="query">
                     <tokenizer class="solr.WhitespaceTokenizerFactory"/>
                     <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="0"
catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
                     <filter class="solr.LowerCaseFilterFactory"/>
                     <filter class="solr.KeywordMarkerFilterFactory"/>
                     <filter class="solr.PorterStemFilterFactory"/>
                     <charFilter class="solr.HTMLStripCharFilterFactory"/>
                 </analyzer>
          </fieldType>

Unfortunatelly that did not fix the error. There are still<h3>  tags inside
the data. Although I believe there are viewer then before but I can not
prove that. Fact is, there are still html tags inside the data.

Any other ideas what the problem could be?





2011/7/25 Markus Jelsma<markus.jel...@openindex.io>

You've three analyzer elements, i wonder what that would do. You need to
add
the char filter to the index-time analyzer.

On Monday 25 July 2011 13:09:14 Merlin Morgenstern wrote:

Hi there,

I am trying to strip html tags from the data before adding the documents

to

the index. To do that I altered schem.xml like this:

          <fieldType name="text" class="solr.TextField"
positionIncrementGap="100" autoGeneratePhraseQueries="true">
                 <analyzer type="index">
                     <tokenizer class="solr.WhitespaceTokenizerFactory"/>
                     <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="1"
catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
                     <filter class="solr.LowerCaseFilterFactory"/>
                     <filter class="solr.KeywordMarkerFilterFactory"/>
                     <filter class="solr.PorterStemFilterFactory"/>
                 </analyzer>
                 <analyzer type="query">
                     <tokenizer class="solr.WhitespaceTokenizerFactory"/>
                     <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="0"
catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
                     <filter class="solr.LowerCaseFilterFactory"/>
                     <filter class="solr.KeywordMarkerFilterFactory"/>
                     <filter class="solr.PorterStemFilterFactory"/>
                 </analyzer>
                 <analyzer>
                     <charFilter class="solr.HTMLStripCharFilterFactory"/>
                      <tokenizer class="solr.WhitespaceTokenizerFactory"/>
                 </analyzer>
          </fieldType>

     <fields>
         <field name="text" type="text" indexed="true" stored="true"
required="false"/>
     </fields>

Unfortunatelly this does not work, the hmtl tags like<h3>  are still
present after restarting and reindexing. I also tryed
htmlstriptransformer, but this did not work either.

Has anybody an idea how to get this done? Thank you in advance for any
hint.

Merlin

--
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Re: strip html from data

Reply via email to