RE: Russian stopwords

Lance Norskog Sat, 06 Dec 2008 14:46:47 -0800

The default encoding on windows is not UTF-8. This causes various weirdness
when you develop on Windows. This has helped me find all places in
string-handling that need the encoding name parameter, so it's not all bad.


Lance 

-----Original Message-----
From: tushar kapoor [mailto:[EMAIL PROTECTED] 
Sent: Saturday, December 06, 2008 1:17 AM
To: solr-user@lucene.apache.org
Subject: RE: Russian stopwords


Hi Steve,

You were right,it turned out to be a an encoding issue but a really weird
one. I was using windows notepad   to save the stopwords file in UTF-8
encoding. On the other hand I was using editplus to save synonyms file. That
was the only difference. The moment I switched to editplus for saving
stopwords file it started working for Russian, German and all type of
languages.

Anyways Thanks for the suggesting a valid direction.

Regards,
Tushar.


Steven A Rowe wrote:
> 
> Hi Tushar,
> 
> On 12/05/2008 at 5:18 AM, tushar kapoor wrote:
>> I am trying to filter russian stopwords but have not been successful 
>> with that.
> [...]
>>       <filter class="solr.StopFilterFactory" ignoreCase="true"
>>              words="stopwords.txt"/>
>>      <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
>>              ignoreCase="true" expand="false"/>
> [...]
>> Intrestingly, Russian synonyms are working fine. English and russian 
>> synonyms get searched correctly.
>>
>> Also,If I add an English language word to stopwords.txt it gets 
>> filtered correctly. Its the russian words that are not getting 
>> filtered as stopwords.
> 
> It might be an encoding issue - StopFilterFactory delegates stopword 
> file reading to SolrResourceLoader.getLines(), which uses an 
> InputStreamReader instantiated with the UTF-8 charset.  Is your 
> stopwords.txt encoded as UTF-8?
> 
> It's strange that synonyms are working fine, though - 
> SynonymFilterFactory reads in the synonyms file using the same 
> mechanism as StopFilterFactory - is it possible that your synonyms 
> file is encoded as UTF-8, but your stopwords file is encoded with a 
> different encoding, perhaps KOI8-R?  Like UTF-8, KOI8-R includes the 
> entirety of 7-bit ASCII, so English words would be properly decoded under
UTF-8.
> 
> Steve
> 
> 

--
View this message in context:
http://www.nabble.com/Russian-stopwords-tp20851093p20868126.html
Sent from the Solr - User mailing list archive at Nabble.com.

RE: Russian stopwords

Reply via email to