The WordNet project at Princeton (USA) is a large database of synonyms.
If you're only working in English this might be useful instead of
running your own analyses.

http://en.wikipedia.org/wiki/WordNet
http://wordnet.princeton.edu/

Lance

-----Original Message-----
From: Eswar K [mailto:[EMAIL PROTECTED] 
Sent: Monday, November 26, 2007 6:34 PM
To: solr-user@lucene.apache.org
Subject: Re: LSA Implementation

In addition to recording which keywords a document contains, the method
examines the document collection as a whole, to see which other
documents contain some of those same words. this algo should consider
documents that have many words in common to be semantically close, and
ones with few words in common to be semantically distant. This simple
method correlates surprisingly well with how a human being, looking at
content, might classify a document collection. Although the algorithm
doesn't understand anything about what the words *mean*, the patterns it
notices can make it seem astonishingly intelligent.

When you search an such  an index, the search engine looks at similarity
values it has calculated for every content word, and returns the
documents that it thinks best fit the query. Because two documents may
be semantically very close even if they do not share a particular
keyword,

Where a plain keyword search will fail if there is no exact match, this
algo will often return relevant documents that don't contain the keyword
at all.

- Eswar

On Nov 27, 2007 7:51 AM, Marvin Humphrey <[EMAIL PROTECTED]> wrote:

>
> On Nov 26, 2007, at 6:06 PM, Eswar K wrote:
>
> > We essentially are looking at having an implementation for doing 
> > search which can return documents having conceptually similar words 
> > without necessarily having the original word searched for.
>
> Very challenging.  Say someone searches for "LSA" and hits an archived

> version of the mail you sent to this list.  "LSA" is a reasonably 
> discriminating term.  But so is "Eswar".
>
> If you knew that the original term was "LSA", then you might look for 
> documents near it in term vector space.  But if you don't know the 
> original term, only the content of the document, how do you know 
> whether you should look for docs near "lsa" or "eswar"?
>
> Marvin Humphrey
> Rectangular Research
> http://www.rectangular.com/
>
>
>

Reply via email to