Re: multilingual list of stopwords

Grant Ingersoll Thu, 18 Oct 2007 05:53:07 -0700

Are you sure they don't just mean they want separate stopword listsfor various different indexes in different languages? Otherwise, Iagree, it doesn't make much sense for a single mixed language index(unless you had an intelligent filter that could select based onlanguage.)

Maria, perhaps you have specific languages you are looking for? Iwould just Google for <Language> stopword list and see what comesup. There are a lot of multilingual resources out there.


-Grant

On Oct 18, 2007, at 7:16 AM, Andrzej Bialecki wrote:

Lukas Vlcek wrote:
Hi,
I haven't heard of multilingual stop words list before. Whatshould be the
purpose of it? This seems to odd to me :-)
That's because multilingual stopword list doesn't make sense ;)
One example that I'm familiar with: words "is" and "by" in Englishand in Swedish. Both words are stopwords in English, but they arecontent words in Swedish (ice and village, respectively).Similarly, "till" in Swedish is a stopword (to, towards), but it'sa content word in English.
So, as Lukas correctly suggested, you should first perform languageidentification, and then apply the correct stopword list.
--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


--------------------------
Grant Ingersoll
http://lucene.grantingersoll.com

Lucene Boot Camp Training:

ApacheCon Atlanta, Nov. 12, 2007. Sign up now! http://www.apachecon.com


Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ

Re: multilingual list of stopwords

Reply via email to