RE: order of analyzers, tokeinizers and filters

Jonathan Rochkind Tue, 14 Sep 2010 06:03:19 -0700

CharFilters go before Tokenizers which go before (token) Filters.  

Token filters (called just <filter> in the config) operate on tokens, so need 
to go after the tokenizer. WhitespaceTokenizer is a tokenizer. 
PatternReplaceFilterFactory is a token filter.


What you probably want instead is solr.PatternReplaceCharFilterFactory, which 
you can include with a <charFilter> (not <filter>) element, and which as a char 
filter goes before the tokenizer. 

Both are regexp replacers, but the char filter version operates on a character 
stream which hasn't been tokenized yet, and the token filter version operates 
on tokens.  A bit confusing because they're both regexp replacers, more clear 
when you consider token filters like stemmers that obviously need to operate on 
tokens, and thus need to go after the tokenizer. 

Jonathan 
________________________________________
From: markus.rietz...@rzf.fin-nrw.de [markus.rietz...@rzf.fin-nrw.de]
Sent: Tuesday, September 14, 2010 7:37 AM
To: solr-user@lucene.apache.org
Subject: order of analyzers, tokeinizers and filters

hi,
it's the second time i am stumble across some strange behaviour:

in my schema.xml i have defined

    <fieldType name="textspell" class="solr.TextField"
positionIncrementGap="100">
      <analyzer type="index">
        <!-- sg324 inkl. HTMLStrip... -->
        <charFilter class="solr.HTMLStripCharFilterFactory" />
        <filter class="solr.PatternReplaceFilterFactory" pattern="/"
replacement=" / " replace="all"/>
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" enablePositionIncrements="true" />
        <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="1"
catenateNumbers="1" catenateAll="0" splitOnCaseChange="0"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords_spelling.txt" enablePositionIncrements="true" />
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>

i can't place the PatternReplaceFilter before the WhitespaceTokenizer. i
have the schema like above, did a reload of my core, but
when i go to analyze in the admin i can see that the WhiteSpaceTokenizer
is executed before the PatternReplaceFilter.

is there a general order of execution?

markus

RE: order of analyzers, tokeinizers and filters

Reply via email to