: http://projecte01.development.barcelonamedia.org/fonetic/ : you will see a "Top Words" list (in Spanish and stemmed) in the list there : is the word "si" which is in 20649 documents. : If you click at this word, the system will perform the query : (x) content:si, with no answers at all : The same for "la" it is in 17881 documents, but the query "content:la" will : give no answers at all ... : To see what's going on on the index I have tested with the analyzer : http://projecte01.development.barcelonamedia.org/solr/admin/analysis.jsp ... : "las cosas que si no pasan la proxima vez si que no verĂ s"
but are you sure that example would actually cause a problem? i suspect if you index thta exact sentence as is you wouldn't see the facet count for "si" or "que" increase at all. If you do a query for "{!raw field=content}que" you bypass the query parsers (which is respecting your stopwords file) and see all docs that contain the raw term "que" in the content field. if you look at some of the docs that match, and paste their content field into the analysis tool, i think you'll see that the problem comes from using the whitespace tokenizer, and is masked by using the WDF after the stop filter ... things like "Que?" are getting ignored by the stopfilter, but ultimately winding up in your index as "que" -Hoss