I think my problem has been solved using 

        <tokenizer class="solr.HTMLStripWhitespaceTokenizerFactory"/>
    (for whitespaces and html tag
    and

        <filter class="solr.PatternReplaceFilterFactory" 
pattern="([^a-zA-Z0-9])" replacement="" replace="all" />

    (for all non alphanumeric chars)

it's true?





________________________________
Da: Antonio Zippo <[EMAIL PROTECTED]>
A: solr-user@lucene.apache.org
Inviato: Venerdì 28 novembre 2008, 17:27:30
Oggetto: PatternReplaceFilterFactory and html tag

Hi all,

i've a text field with some html code
ex. "blablabla <p>hi this is a paragraph</p> aaaa bbb"

i need to exclude theese tag into the index or query so i think i need to use a 
PatternReplaceFilterFactory

this filter is to exclude all chars different from a-zA-Z0-9 (so i can exclude 
punctuation, etc.)

<filter class="solr.PatternReplaceFilterFactory" pattern="([^a-zA-Z0-9])" 
replacement="" replace="all" />

but i need to add a replace for "<p>", "</p>", "<br/>", "<br />",  etc...

could anyone help me to use the right pattern?

thanks in advance
Zippo


      

Reply via email to