Are you sure they don't just mean they want separate stopword lists
for various different indexes in different languages? Otherwise, I
agree, it doesn't make much sense for a single mixed language index
(unless you had an intelligent filter that could select based on
language.)
Maria, perhaps you have specific languages you are looking for? I
would just Google for <Language> stopword list and see what comes
up. There are a lot of multilingual resources out there.
-Grant
On Oct 18, 2007, at 7:16 AM, Andrzej Bialecki wrote:
Lukas Vlcek wrote:
Hi,
I haven't heard of multilingual stop words list before. What
should be the
purpose of it? This seems to odd to me :-)
That's because multilingual stopword list doesn't make sense ;)
One example that I'm familiar with: words "is" and "by" in English
and in Swedish. Both words are stopwords in English, but they are
content words in Swedish (ice and village, respectively).
Similarly, "till" in Swedish is a stopword (to, towards), but it's
a content word in English.
So, as Lukas correctly suggested, you should first perform language
identification, and then apply the correct stopword list.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
--------------------------
Grant Ingersoll
http://lucene.grantingersoll.com
Lucene Boot Camp Training:
ApacheCon Atlanta, Nov. 12, 2007. Sign up now! http://
www.apachecon.com
Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ