CharFilters go before Tokenizers which go before (token) Filters. Token filters (called just <filter> in the config) operate on tokens, so need to go after the tokenizer. WhitespaceTokenizer is a tokenizer. PatternReplaceFilterFactory is a token filter.
What you probably want instead is solr.PatternReplaceCharFilterFactory, which you can include with a <charFilter> (not <filter>) element, and which as a char filter goes before the tokenizer. Both are regexp replacers, but the char filter version operates on a character stream which hasn't been tokenized yet, and the token filter version operates on tokens. A bit confusing because they're both regexp replacers, more clear when you consider token filters like stemmers that obviously need to operate on tokens, and thus need to go after the tokenizer. Jonathan ________________________________________ From: markus.rietz...@rzf.fin-nrw.de [markus.rietz...@rzf.fin-nrw.de] Sent: Tuesday, September 14, 2010 7:37 AM To: solr-user@lucene.apache.org Subject: order of analyzers, tokeinizers and filters hi, it's the second time i am stumble across some strange behaviour: in my schema.xml i have defined <fieldType name="textspell" class="solr.TextField" positionIncrementGap="100"> <analyzer type="index"> <!-- sg324 inkl. HTMLStrip... --> <charFilter class="solr.HTMLStripCharFilterFactory" /> <filter class="solr.PatternReplaceFilterFactory" pattern="/" replacement=" / " replace="all"/> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" /> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="0"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_spelling.txt" enablePositionIncrements="true" /> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> i can't place the PatternReplaceFilter before the WhitespaceTokenizer. i have the schema like above, did a reload of my core, but when i go to analyze in the admin i can see that the WhiteSpaceTokenizer is executed before the PatternReplaceFilter. is there a general order of execution? markus