Re: Standard analyzer and acronyms

Otis Gospodnetic Mon, 22 Sep 2008 07:27:23 -0700

Hi,

Are you sure you are not looking at the original field values? (what is the 
schema browser are you referring to?)
Yes, tokenizer + filters are applied in the order they are defined in, so the 
order is important.  For example, you typically want to lower-case tokens 
before removing stop words because, presumably, your stop words are all 
lower-case.


Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



----- Original Message ----
> From: Luca Molteni <[EMAIL PROTECTED]>
> To: solr-user@lucene.apache.org
> Sent: Monday, September 22, 2008 4:43:43 AM
> Subject: Standard analyzer and acronyms
> 
> Hello, list.
> 
> I found some strange results using the standard analyzer.
> 
> I've put it in both query and index time,  but when I use the schema browser
> to see the commond values for field, i find:
> 
> spa1558 s.p.a. 833
> Which is pretty strange, since I've used the analyzer to remove the dots
> from the acronyms.
> 
> My hypothesis is that the StandardAnalyzer remove dots from only the
> uppercase acronyms.
> 
> Can anyone confirm this to me?
> 
> Regarding this, I was wondering if the filter and the tokenizers are applied
> sequencely using the order in which they are written.
> For example, if I use the StandardAnalyzer, the StopFilter for the words
> "IBM" and the whitespace tokenizer
> 
> "I.B.M Company"
> 
> 1. The standard removes the dot
> 
> "IBM Company"
> 
> 2. The stopfilter removes the word "IBM"
> 
> "Company"
> 
> 3. The analyzer returns only one token
> 
> "Company".
> 
> I know, this is not a great example, but I think that not all the analyzer
> are commutative, then there should be an order in which they are applied.
> 
> Thank you very much.
> 
> L.M.

Re: Standard analyzer and acronyms

Reply via email to