Thanks a lot Peter! Maria
On 10/18/07, Binkley, Peter <[EMAIL PROTECTED]> wrote: > There's code in Nutch to identify the language of a given text: > http://lucene.apache.org/nutch/apidocs/org/apache/nutch/analysis/lang/La > nguageIdentifier.html . > > Peter > > -----Original Message----- > From: Maria Mosolova [mailto:[EMAIL PROTECTED] > Sent: Thursday, October 18, 2007 8:48 AM > To: solr-user@lucene.apache.org > Subject: Re: multilingual list of stopwords > > Thanks a lot to everyone who responded. Yes, I agree that eventually we > need to use separate stopword lists for different languages. > Unfortunately the data we are trying to index at the moment does not > contain any direct country/language information and we need to create > the first version of the index quickly. It does not look like analyzing > documents to determine their languge is something which could be > accomplished in a very limited timeframe. Or am I wrong here and there > are existing analyzers one could use? > Maria > > On 10/18/07, Walter Underwood <[EMAIL PROTECTED]> wrote: > > Also "die" in German and English. --wunder > > > > On 10/18/07 4:16 AM, "Andrzej Bialecki" <[EMAIL PROTECTED]> wrote: > > > > > One example that I'm familiar with: words "is" and "by" in English > > > and in Swedish. Both words are stopwords in English, but they are > > > content words in Swedish (ice and village, respectively). Similarly, > > > > "till" in Swedish is a stopword (to, towards), but it's a content > word in English. > > > > > >