All,

I am having trouble getting my regex pattern to work properly. I have tried
PatternReplaceFilterFactory after the standard tokenizer

<filter class="solr.PatternReplaceFilterFactory" pattern="([^a-z0-9])"
replacement=" " replace="all"/>

and PatternReplaceCharFilterFactory before it.

<charFilter class="solr.PatternReplaceCharFilterFactory"
pattern="([^a-zA-Z0-9])" replacement=" " replace="all"/>

It looks like this should work to remove everything except letters and
numbers.

        <charFilter class="solr.HTMLStripCharFilterFactory"/>
        <filter class="solr.ASCIIFoldingFilterFactory"/>
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>

        <filter class="solr.StopFilterFactory"
                ignoreCase="true"
                words="stopwords_en.txt"
                enablePositionIncrements="true"
                />
        <filter class="solr.LengthFilterFactory" min="2" max="999"/>
        <filter class="solr.PatternReplaceFilterFactory"
pattern="([^a-z0-9])" replacement=" " replace="all"/>

I am left with quite a few facet items like this

<int name="_ view">1443</int>
<int name="view _">1599</int>

Can anyone suggest what may be going on here? I have verified that my regex
works properly here http://www.fileformat.info/tool/regex.htm

Adam

Reply via email to