hi everyone,
if I define a field as <fieldType name="subword" class="solr.TextField" positionIncrementGap="100"> <analyzer type="index"> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true"/> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="1"/> <tokenizer class="org.apache.solr.analysis.EdgeNGramTokenizerFactory" minGramSize="2" maxGramSize="15"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> </analyzer> <analyzer type="query"> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0"/> <tokenizer class="org.apache.solr.analysis.EdgeNGramTokenizerFactory" minGramSize="2" maxGramSize="15"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> </analyzer> </fieldType> I would expect that, when pushing data into it, this is what would happen: - Stop words removed by StopFilterFactory - content broken into several 'words' as per WordDelimiterFilterFactory. - the result of all this passed to EdgeNGram (or nGram) tokenizer so, when indexing 'The Veronicas', only 'Veronicas' would reach the NGram tokenizer.... What I find is that the n-gram tokenizers kick in first, and the filters after, making it a rather moot exercise. I've confirmed the steps in analysis.jsp : Index Analyzer org.apache.solr.analysis.NGramTokenizerFactory {maxGramSize=15, minGramSize=2} [..] org.apache.solr.analysis.StopFilterFactory {words=stopwords.txt, ignoreCase=true, enablePositionIncrements=true} [..] org.apache.solr.analysis.LowerCaseFilterFactory {} [...] org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory {} [...] What am I doing / understanding wrong? thanks!! B _________________________ {Beto|Norberto|Numard} Meijome Windows caters to everyone as though they are idiots. UNIX makes no such assumption. It assumes you know what you are doing, and presents the challenge of figuring it out for yourself if you don't. I speak for myself, not my employer. Contents may be hot. Slippery when wet. Reading disclaimers makes you go blind. Writing them is worse. You have been Warned.