Are you sure they don't just mean they want separate stopword lists for various different indexes in different languages? Otherwise, I agree, it doesn't make much sense for a single mixed language index (unless you had an intelligent filter that could select based on language.)

Maria, perhaps you have specific languages you are looking for? I would just Google for <Language> stopword list and see what comes up. There are a lot of multilingual resources out there.

-Grant

On Oct 18, 2007, at 7:16 AM, Andrzej Bialecki wrote:

Lukas Vlcek wrote:
Hi,
I haven't heard of multilingual stop words list before. What should be the
purpose of it? This seems to odd to me :-)

That's because multilingual stopword list doesn't make sense ;)

One example that I'm familiar with: words "is" and "by" in English and in Swedish. Both words are stopwords in English, but they are content words in Swedish (ice and village, respectively). Similarly, "till" in Swedish is a stopword (to, towards), but it's a content word in English.

So, as Lukas correctly suggested, you should first perform language identification, and then apply the correct stopword list.


--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


--------------------------
Grant Ingersoll
http://lucene.grantingersoll.com

Lucene Boot Camp Training:
ApacheCon Atlanta, Nov. 12, 2007. Sign up now! http:// www.apachecon.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ


Reply via email to