Re: Best way to match umlauts

Steve Rowe Thu, 13 Jun 2013 21:51:08 -0700

Aditya,

Char filters are applied prior to tokenization, so they can affect 
tokenization, but I can't think of any tokenization changes that accent 
stripping would cause.

Token filters can be re-ordered to achieve certain objectives.  For example, if 
you want to use a stemmer that only recognizes lowercase terms, you could put a 
lowercasing filter in front of it.

In your case, if you use a char filter to do the accent stripping, and you use 
a stemmer, you won't be able to order it after stemming, because char filters 
always precede tokenization, which always precedes token filtering.  Stripping 
accents before stemming could be a problem, though, if your stemmer assumes 
properly accented words in order to function properly; in that case, you'd want 
to use a token filter to do the accent stripping instead, and place it after 
your stemmer.

There may be other reasons you'd want to choose one over the other that I'm not 
thinking of, but primarily it's about choosing processing order to affect 
further stages in the pipeline.  If you don't think order matters, then you 
should be fine choosing either one. 

Steve

On Jun 13, 2013, at 8:17 PM, adityab <aditya_ba...@yahoo.com> wrote:

> this might be a dumb question. But can you please point me some key
> difference between ASCIIFolding Filter  and Character Filter using a map
> File.  
> thanks
> Aditya 
> 
> 
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Best-way-to-match-umlauts-tp4070256p4070398.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Re: Best way to match umlauts

Reply via email to