Spanish Stemmer

Darien Rosa Tue, 18 Aug 2009 20:26:10 -0700

Hello,

I am trying to configure Solr to index Spanish documents and I've found some 
problems with the Spanish stemmer. I have a basic install using Tomcat.


I suspect that the Spanish stemmer isn't working very well. The site 
http://snowball.tartarus.org/algorithms/spanish/stemmer.html shows a sample of 
Spanish vocabulary with the stemmed forms that will be generated with the 
algorithm. I tried with several of them and I didn't get the same result.

For example: the site says that the term "chicas" is stemmed as "chic". 
However, in my project, the term "chicas" is stemmed as "chica" (I can see it 
using Luke - Lucene Index Toolbox). I don't realize where the problem is.

Here is a fragment of my schema.xml file:
<analyzer type="index">
       <tokenizer class="solr.StandardTokenizerFactory"/>
       <filter class="solr.ISOLatin1AccentFilterFactory"/>
       <filter class="solr.StopFilterFactory" ignoreCase="true" 
words="stopwords.txt" enablePositionIncrements="true"/>
       <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" 
generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" 
splitOnCaseChange="1" preserveOriginal="1"/>
       <filter class="solr.LowerCaseFilterFactory"/>
       <filter class="solr.SnowballPorterFilterFactory" languange="Spanish"/>
</analyzer>


Please, if someone can provide me any information related to this I would be 
very grateful.



Thanks in advance,



Darien


________________________________
Universidad Central "Marta Abreu" de Las Villas. http://www.uclv.edu.cu

- VI Conferencia Internacional Medio Ambiente Siglo XXI. Universidad Central de 
Las Villas, del 3 al 6 de noviembre de 2009. 
http://eventos.fim.uclv.edu.cu/masxxi
- IV Conferencia Internacional de ECOMATERIALES. Hotel Sierra Maestra. Bayamo, 
del 24 al 27 de noviembre de 2009
- Universidad 2010, La Habana, del 8 al 12 de febrero de 2010. 
http://www.universidad2010.cu

Spanish Stemmer

Reply via email to