Re: facets and stopwords

JCodina Wed, 01 Jul 2009 01:45:20 -0700

Sorry , I was too cryptic.

I you follow this link


http://projecte01.development.barcelonamedia.org/fonetic/
you will see a "Top Words" list (in Spanish and stemmed) in the list there
is the word "si" which is in  20649 documents.
If you click at this word, the system will perform the query 
      (x) content:si, with no answers at all
The same for "la" it is in 17881 documents, but the query  "content:la" will
give no answers at all

the facets list is generated by the query 
http://projecte01.development.barcelonamedia.org/solr/select/?&rows=0&start=0&q=*:*&facet=true&facet.limit=-1&facet.field=content&facet.field=entities_misc&wt=json&json.wrf=jsonp1246437157825&jsoncallback=jsonp1246437157825&_=1246437158023

but the question is why these two words (among others) are there if they are
stop words?

To see what's going on on the index I have tested with the analyzer
http://projecte01.development.barcelonamedia.org/solr/admin/analysis.jsp

If I select the field content and I write the text

"las cosas que si no pasan la proxima vez si que no veràs"
 
i get the following tokens at the end of the analyzer

las     cosa    pasan   proxima vez sí  verà

where que, si, no, la  are removed as treated as stop words.

but... in the schema browser  
http://projecte01.development.barcelonamedia.org/solr/admin/schema.jsp
in the field content "que" is the 3rd word "no" the 4th  "si" and "la" are  
between the top 40 terms...

the analyzer for the content can be seen in this page and has the following
analyzers 


Tokenizer Class:  org.apache.solr.analysis.WhitespaceTokenizerFactory

Filters:

   1. org.apache.solr.analysis.StopFilterFactory
args:{enablePositionIncrements: true words: stopwords.txt ignoreCase: true }
   2. org.apache.solr.analysis.WordDelimiterFilterFactory
args:{catenateWords: 1 catenateNumbers: 1 splitOnCaseChange: 1 catenateAll:
0 generateNumberParts: 1 generateWordParts: 1 }
   3. org.apache.solr.analysis.LowerCaseFilterFactory args:{}
   4. org.apache.solr.analysis.SnowballPorterFilterFactory args:{languange:
Spanish }
   5. org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory args:{}

The field is indexed, tokenized, stored and termvectors are stored.

So, why the stopwords are in the index?





-- 
View this message in context: 
http://www.nabble.com/facets-and-stopwords-tp23952823p24286283.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: facets and stopwords

Reply via email to