The default encoding on windows is not UTF-8. This causes various weirdness when you develop on Windows. This has helped me find all places in string-handling that need the encoding name parameter, so it's not all bad.
Lance -----Original Message----- From: tushar kapoor [mailto:[EMAIL PROTECTED] Sent: Saturday, December 06, 2008 1:17 AM To: solr-user@lucene.apache.org Subject: RE: Russian stopwords Hi Steve, You were right,it turned out to be a an encoding issue but a really weird one. I was using windows notepad to save the stopwords file in UTF-8 encoding. On the other hand I was using editplus to save synonyms file. That was the only difference. The moment I switched to editplus for saving stopwords file it started working for Russian, German and all type of languages. Anyways Thanks for the suggesting a valid direction. Regards, Tushar. Steven A Rowe wrote: > > Hi Tushar, > > On 12/05/2008 at 5:18 AM, tushar kapoor wrote: >> I am trying to filter russian stopwords but have not been successful >> with that. > [...] >> <filter class="solr.StopFilterFactory" ignoreCase="true" >> words="stopwords.txt"/> >> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" >> ignoreCase="true" expand="false"/> > [...] >> Intrestingly, Russian synonyms are working fine. English and russian >> synonyms get searched correctly. >> >> Also,If I add an English language word to stopwords.txt it gets >> filtered correctly. Its the russian words that are not getting >> filtered as stopwords. > > It might be an encoding issue - StopFilterFactory delegates stopword > file reading to SolrResourceLoader.getLines(), which uses an > InputStreamReader instantiated with the UTF-8 charset. Is your > stopwords.txt encoded as UTF-8? > > It's strange that synonyms are working fine, though - > SynonymFilterFactory reads in the synonyms file using the same > mechanism as StopFilterFactory - is it possible that your synonyms > file is encoded as UTF-8, but your stopwords file is encoded with a > different encoding, perhaps KOI8-R? Like UTF-8, KOI8-R includes the > entirety of 7-bit ASCII, so English words would be properly decoded under UTF-8. > > Steve > > -- View this message in context: http://www.nabble.com/Russian-stopwords-tp20851093p20868126.html Sent from the Solr - User mailing list archive at Nabble.com.