Maria,

It's perfectly reasonable to build a single list, sort it, and scan it for
especially bad cases. See for example,
http://members.unine.ch/jacques.savoy/clef/index.html for stopwords for
several languages or check in some standard programming modules like:
http://search.cpan.org/~fabpot/Lingua-StopWords-0.02/lib/Lingua/StopWords.pm



On 10/18/07, Maria Mosolova <[EMAIL PROTECTED]> wrote:
>
> Thanks a lot to everyone who responded. Yes, I agree that eventually
> we need to use separate stopword lists for different languages.
> Unfortunately the data we are trying to index at the moment does not
> contain any direct country/language information and we need to create
> the first version of the index quickly. It does not look like
> analyzing  documents to determine their languge is something which
> could be accomplished in a very limited timeframe. Or am I wrong here
> and there are existing analyzers one could use?
> Maria
>
> On 10/18/07, Walter Underwood <[EMAIL PROTECTED]> wrote:
> > Also "die" in German and English. --wunder
> >
> > On 10/18/07 4:16 AM, "Andrzej Bialecki" <[EMAIL PROTECTED]> wrote:
> >
> > > One example that I'm familiar with: words "is" and "by" in English and
> > > in Swedish. Both words are stopwords in English, but they are content
> > > words in Swedish (ice and village, respectively). Similarly, "till" in
> > > Swedish is a stopword (to, towards), but it's a content word in
> English.
> >
> >
>

Reply via email to