: http://projecte01.development.barcelonamedia.org/fonetic/
: you will see a "Top Words" list (in Spanish and stemmed) in the list there
: is the word "si" which is in  20649 documents.
: If you click at this word, the system will perform the query 
:       (x) content:si, with no answers at all
: The same for "la" it is in 17881 documents, but the query  "content:la" will
: give no answers at all
        ...
: To see what's going on on the index I have tested with the analyzer
: http://projecte01.development.barcelonamedia.org/solr/admin/analysis.jsp
        ...
: "las cosas que si no pasan la proxima vez si que no verĂ s"

but are you sure that example would actually cause a problem?
i suspect if you index thta exact sentence as is you wouldn't see the 
facet count for "si" or "que" increase at all.

If you do a query for "{!raw field=content}que" you bypass the query 
parsers (which is respecting your stopwords file) and see all docs that 
contain the raw term "que" in the content field.

if you look at some of the docs that match, and paste their content field 
into the analysis tool, i think you'll see that the problem comes from 
using the whitespace tokenizer, and is masked by using the WDF 
after the stop filter ... things like "Que?" are getting ignored by the 
stopfilter, but ultimately winding up in your index as "que"


-Hoss

Reply via email to