Re: strip html from data

Mike Sokolov Mon, 25 Jul 2011 09:49:40 -0700

Hmm that looks like it's working fine.  I stand corrected.



On 07/25/2011 12:24 PM, Markus Jelsma wrote:

I've seen that issue too and read comments on the list yet i've never had
trouble with the order, don't know what's going on. Check this analyzer, i've
moved the charFilter to the bottom:

<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0"
splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="false" expand="true"/>
<filter class="solr.StopFilterFactory" ignoreCase="false"
words="stopwords.txt"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
<filter class="solr.SnowballPorterFilterFactory" protected="protwords.txt"
language="Dutch"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
<charFilter class="solr.HTMLStripCharFilterFactory"/>
</analyzer>

The analysis chain still does its job as i expect for the input:
<span>bla bla</span>

Index Analyzer
org.apache.solr.analysis.HTMLStripCharFilterFactory
{luceneMatchVersion=LUCENE_34}
text    bla bla
org.apache.solr.analysis.WhitespaceTokenizerFactory
{luceneMatchVersion=LUCENE_34}
position        1       2
term text       bla     bla
startOffset     6       10
endOffset       9       13
org.apache.solr.analysis.WordDelimiterFilterFactory {splitOnCaseChange=1,
generateNumberParts=1, catenateWords=1, luceneMatchVersion=LUCENE_34,
generateWordParts=1, catenateAll=0, catenateNumbers=1}
position        1       2
term text       bla     bla
startOffset     6       10
endOffset       9       13
type    word    word
org.apache.solr.analysis.LowerCaseFilterFactory {luceneMatchVersion=LUCENE_34}
position        1       2
term text       bla     bla
startOffset     6       10
endOffset       9       13
type    word    word
org.apache.solr.analysis.SynonymFilterFactory {synonyms=synonyms.txt,
expand=true, ignoreCase=false, luceneMatchVersion=LUCENE_34}
position        1       2
term text       bla     bla
type    word    word
startOffset     6       10
endOffset       9       13
org.apache.solr.analysis.StopFilterFactory {words=stopwords.txt,
ignoreCase=false, luceneMatchVersion=LUCENE_34}
position        1       2
term text       bla     bla
type    word    word
startOffset     6       10
endOffset       9       13
org.apache.solr.analysis.ASCIIFoldingFilterFactory
{luceneMatchVersion=LUCENE_34}
position        1       2
term text       bla     bla
type    word    word
startOffset     6       10
endOffset       9       13
org.apache.solr.analysis.SnowballPorterFilterFactory {protected=protwords.txt,
language=Dutch, luceneMatchVersion=LUCENE_34}
position        1       2
term text       bla     bla
keyword         false   false
type    word    word
startOffset     6       10
endOffset       9       13
org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory
{luceneMatchVersion=LUCENE_34}
position        1       2
term text       bla     bla
keyword         false   false
type    word    word
startOffset     6       10
endOffset       9       13


On Monday 25 July 2011 18:07:29 Mike Sokolov wrote:

Hmm - I'm not sure about that; see
https://issues.apache.org/jira/browse/SOLR-2119

On 07/25/2011 12:01 PM, Markus Jelsma wrote:

charFilters are executed first regardless of their position in the
analyzer.

On Monday 25 July 2011 17:53:59 Mike Sokolov wrote:

I think you need to list the charfilter earlier in the analysis chain;
before the tokenizer.  Porbably Solr should tell you this...

-Mike

On 07/25/2011 09:03 AM, Merlin Morgenstern wrote:

sounds logical. I just changed it to the following, restarted and
reindexed

with commit:
            <fieldType name="text" class="solr.TextField"

positionIncrementGap="100" autoGeneratePhraseQueries="true">

                   <analyzer type="index">

                       <tokenizer
                       class="solr.WhitespaceTokenizerFactory"/>
                       <filter class="solr.WordDelimiterFilterFactory"

generateWordParts="1" generateNumberParts="1" catenateWords="1"
catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>

                       <filter class="solr.LowerCaseFilterFactory"/>
                       <filter class="solr.KeywordMarkerFilterFactory"/>
                       <filter class="solr.PorterStemFilterFactory"/>
                       <charFilter
                       class="solr.HTMLStripCharFilterFactory"/>

                   </analyzer>
                   <analyzer type="query">

                       <tokenizer
                       class="solr.WhitespaceTokenizerFactory"/>
                       <filter class="solr.WordDelimiterFilterFactory"

generateWordParts="1" generateNumberParts="1" catenateWords="0"
catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>

                       <filter class="solr.LowerCaseFilterFactory"/>
                       <filter class="solr.KeywordMarkerFilterFactory"/>
                       <filter class="solr.PorterStemFilterFactory"/>
                       <charFilter
                       class="solr.HTMLStripCharFilterFactory"/>

                   </analyzer>

            </fieldType>

Unfortunatelly that did not fix the error. There are still<h3>    tags
inside the data. Although I believe there are viewer then before but I
can not prove that. Fact is, there are still html tags inside the data.

Any other ideas what the problem could be?





2011/7/25 Markus Jelsma<markus.jel...@openindex.io>

You've three analyzer elements, i wonder what that would do. You need
to add
the char filter to the index-time analyzer.

On Monday 25 July 2011 13:09:14 Merlin Morgenstern wrote:

Hi there,

I am trying to strip html tags from the data before adding the
documents

to

the index. To do that I altered schem.xml like this:
            <fieldType name="text" class="solr.TextField"

positionIncrementGap="100" autoGeneratePhraseQueries="true">

                   <analyzer type="index">

                       <tokenizer
                       class="solr.WhitespaceTokenizerFactory"/>
                       <filter
                       class="solr.WordDelimiterFilterFactory"

generateWordParts="1" generateNumberParts="1" catenateWords="1"
catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>

                       <filter class="solr.LowerCaseFilterFactory"/>
                       <filter
                       class="solr.KeywordMarkerFilterFactory"/>
                       <filter class="solr.PorterStemFilterFactory"/>

                   </analyzer>
                   <analyzer type="query">

                       <tokenizer
                       class="solr.WhitespaceTokenizerFactory"/>
                       <filter
                       class="solr.WordDelimiterFilterFactory"

generateWordParts="1" generateNumberParts="1" catenateWords="0"
catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>

                       <filter class="solr.LowerCaseFilterFactory"/>
                       <filter
                       class="solr.KeywordMarkerFilterFactory"/>
                       <filter class="solr.PorterStemFilterFactory"/>

                   </analyzer>
                   <analyzer>

                       <charFilter
                       class="solr.HTMLStripCharFilterFactory"/>

                        <tokenizer
                        class="solr.WhitespaceTokenizerFactory"/>

                   </analyzer>

            </fieldType>

       <fields>

           <field name="text" type="text" indexed="true" stored="true"

required="false"/>

       </fields>

Unfortunatelly this does not work, the hmtl tags like<h3>    are still
present after restarting and reindexing. I also tryed
htmlstriptransformer, but this did not work either.

Has anybody an idea how to get this done? Thank you in advance for
any hint.

Merlin

--
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Re: strip html from data

Reply via email to