Re: (Edge)NGram tokenizer interaction with other filters

Otis Gospodnetic Tue, 24 Jun 2008 04:56:21 -0700

One tokenizer is followed by filters.  I think this all might be a bit clearer 
if you read the chapter about Analyzers in Lucene in Action if you have a copy. 
 I think if you try to break down that "the result of all this passed to " into 
something more concrete and real you will see how things (should) work.



Otis --
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


----- Original Message ----
> From: Norberto Meijome <[EMAIL PROTECTED]>
> To: SOLR-Usr-ML <solr-user@lucene.apache.org>
> Sent: Tuesday, June 24, 2008 3:19:09 AM
> Subject: (Edge)NGram tokenizer interaction with other filters
> 
> hi everyone,
> 
> 
> if I define a field as 
> 
>       
>          positionIncrementGap="100">
>             
>                 
>                  words="stopwords.txt" enablePositionIncrements="true"/>
>           
>         
>                  generateWordParts="1" generateNumberParts="1" 
> catenateWords="1"
>                  catenateNumbers="1" catenateAll="1"/>
> 
>                 
>                  class="org.apache.solr.analysis.EdgeNGramTokenizerFactory"
>                  minGramSize="2" maxGramSize="15"/>
>                 
>                 
>             
>             
> 
>                 
>                  words="stopwords.txt"/>
> 
>                 
>                  generateWordParts="1" generateNumberParts="1" 
> catenateWords="0"
>                  catenateNumbers="0" catenateAll="0"/>
> 
>                 
>                  class="org.apache.solr.analysis.EdgeNGramTokenizerFactory"
>                  minGramSize="2" maxGramSize="15"/>
>                 
>                 
>                 
>             
>         
>         
> I would expect that, when pushing data into it, this is what would happen:
> - Stop words removed by StopFilterFactory
> - content broken into several 'words' as per WordDelimiterFilterFactory.
> - the result of all this passed to EdgeNGram (or nGram) tokenizer
> 
> so, when indexing 'The Veronicas', only 'Veronicas' would reach the NGram 
> tokenizer....
> 
> What I find is that the n-gram tokenizers kick in first, and the filters 
> after, 
> making it a rather moot exercise. I've confirmed the steps in analysis.jsp :
> 
> Index Analyzer
> org.apache.solr.analysis.NGramTokenizerFactory {maxGramSize=15, minGramSize=2}
> [..]
> org.apache.solr.analysis.StopFilterFactory {words=stopwords.txt, 
> ignoreCase=true, enablePositionIncrements=true}
> [..]
> org.apache.solr.analysis.LowerCaseFilterFactory {}
> [...]
> org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory {}
> [...]
> 
> What am I doing / understanding wrong? 
> 
> thanks!!
> B
> _________________________
> {Beto|Norberto|Numard} Meijome
> 
> Windows caters to everyone as though they are idiots. UNIX makes no such 
> assumption. It assumes you know what you are doing, and presents the 
> challenge 
> of figuring  it out for yourself if you don't.
> 
> I speak for myself, not my employer. Contents may be hot. Slippery when wet. 
> Reading disclaimers makes you go blind. Writing them is worse. You have been 
> Warned.

Re: (Edge)NGram tokenizer interaction with other filters

Reply via email to