Re: strip html from data

Mike Sokolov Mon, 25 Jul 2011 09:08:10 -0700

Hmm - I'm not sure about that; seehttps://issues.apache.org/jira/browse/SOLR-2119


On 07/25/2011 12:01 PM, Markus Jelsma wrote:

charFilters are executed first regardless of their position in the analyzer.


On Monday 25 July 2011 17:53:59 Mike Sokolov wrote:

I think you need to list the charfilter earlier in the analysis chain;
before the tokenizer.  Porbably Solr should tell you this...

-Mike

On 07/25/2011 09:03 AM, Merlin Morgenstern wrote:

sounds logical. I just changed it to the following, restarted and
reindexed

with commit:
           <fieldType name="text" class="solr.TextField"

positionIncrementGap="100" autoGeneratePhraseQueries="true">

                  <analyzer type="index">

                      <tokenizer class="solr.WhitespaceTokenizerFactory"/>
                      <filter class="solr.WordDelimiterFilterFactory"

generateWordParts="1" generateNumberParts="1" catenateWords="1"
catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>

                      <filter class="solr.LowerCaseFilterFactory"/>
                      <filter class="solr.KeywordMarkerFilterFactory"/>
                      <filter class="solr.PorterStemFilterFactory"/>
                      <charFilter
                      class="solr.HTMLStripCharFilterFactory"/>

                  </analyzer>
                  <analyzer type="query">

                      <tokenizer class="solr.WhitespaceTokenizerFactory"/>
                      <filter class="solr.WordDelimiterFilterFactory"

generateWordParts="1" generateNumberParts="1" catenateWords="0"
catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>

                      <filter class="solr.LowerCaseFilterFactory"/>
                      <filter class="solr.KeywordMarkerFilterFactory"/>
                      <filter class="solr.PorterStemFilterFactory"/>
                      <charFilter
                      class="solr.HTMLStripCharFilterFactory"/>

                  </analyzer>

           </fieldType>

Unfortunatelly that did not fix the error. There are still<h3>   tags
inside the data. Although I believe there are viewer then before but I
can not prove that. Fact is, there are still html tags inside the data.

Any other ideas what the problem could be?





2011/7/25 Markus Jelsma<markus.jel...@openindex.io>

You've three analyzer elements, i wonder what that would do. You need to
add
the char filter to the index-time analyzer.

On Monday 25 July 2011 13:09:14 Merlin Morgenstern wrote:

Hi there,

I am trying to strip html tags from the data before adding the
documents

to

the index. To do that I altered schem.xml like this:
           <fieldType name="text" class="solr.TextField"

positionIncrementGap="100" autoGeneratePhraseQueries="true">

                  <analyzer type="index">

                      <tokenizer
                      class="solr.WhitespaceTokenizerFactory"/>  <filter
                      class="solr.WordDelimiterFilterFactory"

generateWordParts="1" generateNumberParts="1" catenateWords="1"
catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>

                      <filter class="solr.LowerCaseFilterFactory"/>
                      <filter class="solr.KeywordMarkerFilterFactory"/>
                      <filter class="solr.PorterStemFilterFactory"/>

                  </analyzer>
                  <analyzer type="query">

                      <tokenizer
                      class="solr.WhitespaceTokenizerFactory"/>  <filter
                      class="solr.WordDelimiterFilterFactory"

generateWordParts="1" generateNumberParts="1" catenateWords="0"
catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>

                      <filter class="solr.LowerCaseFilterFactory"/>
                      <filter class="solr.KeywordMarkerFilterFactory"/>
                      <filter class="solr.PorterStemFilterFactory"/>

                  </analyzer>
                  <analyzer>

                      <charFilter
                      class="solr.HTMLStripCharFilterFactory"/>

                       <tokenizer
                       class="solr.WhitespaceTokenizerFactory"/>

                  </analyzer>

           </fieldType>

      <fields>

          <field name="text" type="text" indexed="true" stored="true"

required="false"/>

      </fields>

Unfortunatelly this does not work, the hmtl tags like<h3>   are still
present after restarting and reindexing. I also tryed
htmlstriptransformer, but this did not work either.

Has anybody an idea how to get this done? Thank you in advance for any
hint.

Merlin

--
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Re: strip html from data

Reply via email to