RE: Russian stopwords

tushar kapoor Sat, 06 Dec 2008 01:17:18 -0800

Hi Steve,

You were right,it turned out to be a an encoding issue but a really weird
one. I was using windows notepad   to save the stopwords file in UTF-8
encoding. On the other hand I was using editplus to save synonyms file. That
was the only difference. The moment I switched to editplus for saving
stopwords file it started working for Russian, German and all type of
languages.


Anyways Thanks for the suggesting a valid direction.

Regards,
Tushar.


Steven A Rowe wrote:
> 
> Hi Tushar,
> 
> On 12/05/2008 at 5:18 AM, tushar kapoor wrote:
>> I am trying to filter russian stopwords but have not been
>> successful with that.
> [...]
>>       <filter class="solr.StopFilterFactory" ignoreCase="true"
>>              words="stopwords.txt"/>
>>      <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
>>              ignoreCase="true" expand="false"/>
> [...]
>> Intrestingly, Russian synonyms are working fine. English and russian
>> synonyms get searched correctly.
>>
>> Also,If I add an English language word to stopwords.txt it
>> gets filtered correctly. Its the russian words that are not
>> getting filtered as stopwords.
> 
> It might be an encoding issue - StopFilterFactory delegates stopword file
> reading to SolrResourceLoader.getLines(), which uses an InputStreamReader
> instantiated with the UTF-8 charset.  Is your stopwords.txt encoded as
> UTF-8?
> 
> It's strange that synonyms are working fine, though - SynonymFilterFactory
> reads in the synonyms file using the same mechanism as StopFilterFactory -
> is it possible that your synonyms file is encoded as UTF-8, but your
> stopwords file is encoded with a different encoding, perhaps KOI8-R?  Like
> UTF-8, KOI8-R includes the entirety of 7-bit ASCII, so English words would
> be properly decoded under UTF-8.
> 
> Steve
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Russian-stopwords-tp20851093p20868126.html
Sent from the Solr - User mailing list archive at Nabble.com.

RE: Russian stopwords

Reply via email to