(Edge)NGram tokenizer interaction with other filters

Norberto Meijome Tue, 24 Jun 2008 00:19:47 -0700

hi everyone,


if I define a field as 

       <fieldType name="subword" class="solr.TextField"
         positionIncrementGap="100">
            <analyzer type="index">
                <filter class="solr.StopFilterFactory" ignoreCase="true"
                 words="stopwords.txt" enablePositionIncrements="true"/>
           
             <filter class="solr.WordDelimiterFilterFactory"
                 generateWordParts="1" generateNumberParts="1" catenateWords="1"
                 catenateNumbers="1" catenateAll="1"/>

                <tokenizer
                 class="org.apache.solr.analysis.EdgeNGramTokenizerFactory"
                 minGramSize="2" maxGramSize="15"/>
                <filter class="solr.LowerCaseFilterFactory"/>
                <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
            </analyzer>
            <analyzer type="query">

                 <filter class="solr.StopFilterFactory" ignoreCase="true"
                 words="stopwords.txt"/>

                <filter class="solr.WordDelimiterFilterFactory"
                 generateWordParts="1" generateNumberParts="1" catenateWords="0"
                 catenateNumbers="0" catenateAll="0"/>

                <tokenizer
                 class="org.apache.solr.analysis.EdgeNGramTokenizerFactory"
                 minGramSize="2" maxGramSize="15"/>
                <filter class="solr.LowerCaseFilterFactory"/>
                <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
                
            </analyzer>
        </fieldType>
        
I would expect that, when pushing data into it, this is what would happen:
 - Stop words removed by StopFilterFactory
 - content broken into several 'words' as per WordDelimiterFilterFactory.
 - the result of all this passed to EdgeNGram (or nGram) tokenizer

so, when indexing 'The Veronicas', only 'Veronicas' would reach the NGram 
tokenizer....

What I find is that the n-gram tokenizers kick in first, and the filters after, 
making it a rather moot exercise. I've confirmed the steps in analysis.jsp :

Index Analyzer
org.apache.solr.analysis.NGramTokenizerFactory {maxGramSize=15, minGramSize=2}
[..]
org.apache.solr.analysis.StopFilterFactory {words=stopwords.txt, 
ignoreCase=true, enablePositionIncrements=true}
[..]
org.apache.solr.analysis.LowerCaseFilterFactory {}
[...]
org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory {}
[...]

What am I doing / understanding wrong? 

thanks!!
B
_________________________
{Beto|Norberto|Numard} Meijome

Windows caters to everyone as though they are idiots. UNIX makes no such 
assumption. It assumes you know what you are doing, and presents the challenge 
of figuring  it out for yourself if you don't.

I speak for myself, not my employer. Contents may be hot. Slippery when wet. 
Reading disclaimers makes you go blind. Writing them is worse. You have been 
Warned.

(Edge)NGram tokenizer interaction with other filters

Reply via email to