The languages also include CJK :) among others.

- Eswar

On Nov 27, 2007 8:16 AM, Norskog, Lance <[EMAIL PROTECTED]> wrote:

> The WordNet project at Princeton (USA) is a large database of synonyms.
> If you're only working in English this might be useful instead of
> running your own analyses.
>
> http://en.wikipedia.org/wiki/WordNet
> http://wordnet.princeton.edu/
>
> Lance
>
> -----Original Message-----
> From: Eswar K [mailto:[EMAIL PROTECTED]
> Sent: Monday, November 26, 2007 6:34 PM
> To: solr-user@lucene.apache.org
> Subject: Re: LSA Implementation
>
> In addition to recording which keywords a document contains, the method
> examines the document collection as a whole, to see which other
> documents contain some of those same words. this algo should consider
> documents that have many words in common to be semantically close, and
> ones with few words in common to be semantically distant. This simple
> method correlates surprisingly well with how a human being, looking at
> content, might classify a document collection. Although the algorithm
> doesn't understand anything about what the words *mean*, the patterns it
> notices can make it seem astonishingly intelligent.
>
> When you search an such  an index, the search engine looks at similarity
> values it has calculated for every content word, and returns the
> documents that it thinks best fit the query. Because two documents may
> be semantically very close even if they do not share a particular
> keyword,
>
> Where a plain keyword search will fail if there is no exact match, this
> algo will often return relevant documents that don't contain the keyword
> at all.
>
> - Eswar
>
> On Nov 27, 2007 7:51 AM, Marvin Humphrey <[EMAIL PROTECTED]> wrote:
>
> >
> > On Nov 26, 2007, at 6:06 PM, Eswar K wrote:
> >
> > > We essentially are looking at having an implementation for doing
> > > search which can return documents having conceptually similar words
> > > without necessarily having the original word searched for.
> >
> > Very challenging.  Say someone searches for "LSA" and hits an archived
>
> > version of the mail you sent to this list.  "LSA" is a reasonably
> > discriminating term.  But so is "Eswar".
> >
> > If you knew that the original term was "LSA", then you might look for
> > documents near it in term vector space.  But if you don't know the
> > original term, only the content of the document, how do you know
> > whether you should look for docs near "lsa" or "eswar"?
> >
> > Marvin Humphrey
> > Rectangular Research
> > http://www.rectangular.com/
> >
> >
> >
>

Reply via email to