Re: multilingual list of stopwords

Maria Mosolova Thu, 18 Oct 2007 08:19:17 -0700

Thanks a lot Peter!
Maria


On 10/18/07, Binkley, Peter <[EMAIL PROTECTED]> wrote:
> There's code in Nutch to identify the language of a given text:
> http://lucene.apache.org/nutch/apidocs/org/apache/nutch/analysis/lang/La
> nguageIdentifier.html .
>
> Peter
>
> -----Original Message-----
> From: Maria Mosolova [mailto:[EMAIL PROTECTED]
> Sent: Thursday, October 18, 2007 8:48 AM
> To: solr-user@lucene.apache.org
> Subject: Re: multilingual list of stopwords
>
> Thanks a lot to everyone who responded. Yes, I agree that eventually we
> need to use separate stopword lists for different languages.
> Unfortunately the data we are trying to index at the moment does not
> contain any direct country/language information and we need to create
> the first version of the index quickly. It does not look like analyzing
> documents to determine their languge is something which could be
> accomplished in a very limited timeframe. Or am I wrong here and there
> are existing analyzers one could use?
> Maria
>
> On 10/18/07, Walter Underwood <[EMAIL PROTECTED]> wrote:
> > Also "die" in German and English. --wunder
> >
> > On 10/18/07 4:16 AM, "Andrzej Bialecki" <[EMAIL PROTECTED]> wrote:
> >
> > > One example that I'm familiar with: words "is" and "by" in English
> > > and in Swedish. Both words are stopwords in English, but they are
> > > content words in Swedish (ice and village, respectively). Similarly,
>
> > > "till" in Swedish is a stopword (to, towards), but it's a content
> word in English.
> >
> >
>
>

Re: multilingual list of stopwords

Reply via email to