Re: multilingual list of stopwords

Lukas Vlcek Thu, 18 Oct 2007 03:31:19 -0700

Hi,

I haven't heard of multilingual stop words list before. What should be the
purpose of it? This seems to odd to me :-)
Stop words are used to cut down the size of index.

One way you can go about this is to create your own list by indexing your
documents (without stop words removed) and then looking at the most frequent
words and create the list by picking some of them. This could work if you
want to index static set of documents (so you know what your content is all
about and you can leave some words without loosing any important
information).

But I think the preferred way is to identify language first and then use
specific language stop list.

If you can't use language identification then you can try creative ways
like:
Employing some kind of document classification algorithm and then creating
stop lists for each class. Then with every new document you will determine
first in which class it belongs and then apply particular stop list.
I am just sucking the wind here...

Regards,
Lukas

On 10/18/07, Joseph Doehr <[EMAIL PROTECTED]> wrote:
>
>
>         Hi Maria,
>
> this is a "me too". ;)
> At the moment I'll take the way to merge the various language stopword
> files I need to one and use it. But the main problem in this case is,
> having collusions with words which are stopwords in one language and in
> the other not.
>
>         Cheers,
>         Joe
>
>
> Maria Mosolova schrieb:
> > I am looking for a multilingual list of stopwords to use with
> > Solr/Lucene and would greatly appreciate an advice on where I could
> > find it.
>
>

-- 
http://blog.lukas-vlcek.com/

Re: multilingual list of stopwords

Reply via email to