One tokenizer is followed by filters. I think this all might be a bit clearer if you read the chapter about Analyzers in Lucene in Action if you have a copy. I think if you try to break down that "the result of all this passed to " into something more concrete and real you will see how things (should) work.
Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch ----- Original Message ---- > From: Norberto Meijome <[EMAIL PROTECTED]> > To: SOLR-Usr-ML <solr-user@lucene.apache.org> > Sent: Tuesday, June 24, 2008 3:19:09 AM > Subject: (Edge)NGram tokenizer interaction with other filters > > hi everyone, > > > if I define a field as > > > positionIncrementGap="100"> > > > words="stopwords.txt" enablePositionIncrements="true"/> > > > generateWordParts="1" generateNumberParts="1" > catenateWords="1" > catenateNumbers="1" catenateAll="1"/> > > > class="org.apache.solr.analysis.EdgeNGramTokenizerFactory" > minGramSize="2" maxGramSize="15"/> > > > > > > > words="stopwords.txt"/> > > > generateWordParts="1" generateNumberParts="1" > catenateWords="0" > catenateNumbers="0" catenateAll="0"/> > > > class="org.apache.solr.analysis.EdgeNGramTokenizerFactory" > minGramSize="2" maxGramSize="15"/> > > > > > > > I would expect that, when pushing data into it, this is what would happen: > - Stop words removed by StopFilterFactory > - content broken into several 'words' as per WordDelimiterFilterFactory. > - the result of all this passed to EdgeNGram (or nGram) tokenizer > > so, when indexing 'The Veronicas', only 'Veronicas' would reach the NGram > tokenizer.... > > What I find is that the n-gram tokenizers kick in first, and the filters > after, > making it a rather moot exercise. I've confirmed the steps in analysis.jsp : > > Index Analyzer > org.apache.solr.analysis.NGramTokenizerFactory {maxGramSize=15, minGramSize=2} > [..] > org.apache.solr.analysis.StopFilterFactory {words=stopwords.txt, > ignoreCase=true, enablePositionIncrements=true} > [..] > org.apache.solr.analysis.LowerCaseFilterFactory {} > [...] > org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory {} > [...] > > What am I doing / understanding wrong? > > thanks!! > B > _________________________ > {Beto|Norberto|Numard} Meijome > > Windows caters to everyone as though they are idiots. UNIX makes no such > assumption. It assumes you know what you are doing, and presents the > challenge > of figuring it out for yourself if you don't. > > I speak for myself, not my employer. Contents may be hot. Slippery when wet. > Reading disclaimers makes you go blind. Writing them is worse. You have been > Warned.