Hello!

I´ve got some encoding problems with my currently new analyzer
configuration. I´ve deployed a Solr server in Apache Tomcat setting
Tomcat´s encoding to UTF-8 in server.xml. Also Solr´s encoding is setted to
UTF-8 in schema.xml. I have defined a fieldType like the following:
*    <fieldType name="textSearch" class="solr.TextField"
positionIncrementGap="100">*
* <analyzer>*
* <charFilter class="solr.MappingCharFilterFactory"
mapping="charsToRemove.txt"/>*
* <tokenizer class="solr.WhitespaceTokenizerFactory"/>*
* <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords_es.txt"/>*
* <filter class="solr.WordDelimiterFilterFactory"*
* splitOnCaseChange="1"*
* splitOnNumerics="1"*
* stemEnglishPossessive="1"*
* generateWordParts="1"*
* generateNumberParts="1"*
* preserveOriginal="1"*
* />*
* <filter class="solr.ASCIIFoldingFilterFactory"/>*
* <filter class="solr.SnowballPorterFilterFactory" language="Spanish" />*
* <filter class="solr.LowerCaseFilterFactory"/>*
* <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>*
* </analyzer>      *
*    </fieldType>*


I don´t know why, but inmediatly translates an input like "sueños" (dreams,
in English) to something like "sueños". That produces that
WordDelimiterFilterFactory splits the token into "sue à os", with obviously
affects directly to search queries which includes de original "sueños"
term. It looks like that Solr encoding isn´t UTF-8.

Any tips or suggestions?

Thank you very much.

-- 

- Luis Cappa

Reply via email to