Maria, It's perfectly reasonable to build a single list, sort it, and scan it for especially bad cases. See for example, http://members.unine.ch/jacques.savoy/clef/index.html for stopwords for several languages or check in some standard programming modules like: http://search.cpan.org/~fabpot/Lingua-StopWords-0.02/lib/Lingua/StopWords.pm
On 10/18/07, Maria Mosolova <[EMAIL PROTECTED]> wrote: > > Thanks a lot to everyone who responded. Yes, I agree that eventually > we need to use separate stopword lists for different languages. > Unfortunately the data we are trying to index at the moment does not > contain any direct country/language information and we need to create > the first version of the index quickly. It does not look like > analyzing documents to determine their languge is something which > could be accomplished in a very limited timeframe. Or am I wrong here > and there are existing analyzers one could use? > Maria > > On 10/18/07, Walter Underwood <[EMAIL PROTECTED]> wrote: > > Also "die" in German and English. --wunder > > > > On 10/18/07 4:16 AM, "Andrzej Bialecki" <[EMAIL PROTECTED]> wrote: > > > > > One example that I'm familiar with: words "is" and "by" in English and > > > in Swedish. Both words are stopwords in English, but they are content > > > words in Swedish (ice and village, respectively). Similarly, "till" in > > > Swedish is a stopword (to, towards), but it's a content word in > English. > > > > >