The languages also include CJK :) among others. - Eswar
On Nov 27, 2007 8:16 AM, Norskog, Lance <[EMAIL PROTECTED]> wrote: > The WordNet project at Princeton (USA) is a large database of synonyms. > If you're only working in English this might be useful instead of > running your own analyses. > > http://en.wikipedia.org/wiki/WordNet > http://wordnet.princeton.edu/ > > Lance > > -----Original Message----- > From: Eswar K [mailto:[EMAIL PROTECTED] > Sent: Monday, November 26, 2007 6:34 PM > To: solr-user@lucene.apache.org > Subject: Re: LSA Implementation > > In addition to recording which keywords a document contains, the method > examines the document collection as a whole, to see which other > documents contain some of those same words. this algo should consider > documents that have many words in common to be semantically close, and > ones with few words in common to be semantically distant. This simple > method correlates surprisingly well with how a human being, looking at > content, might classify a document collection. Although the algorithm > doesn't understand anything about what the words *mean*, the patterns it > notices can make it seem astonishingly intelligent. > > When you search an such an index, the search engine looks at similarity > values it has calculated for every content word, and returns the > documents that it thinks best fit the query. Because two documents may > be semantically very close even if they do not share a particular > keyword, > > Where a plain keyword search will fail if there is no exact match, this > algo will often return relevant documents that don't contain the keyword > at all. > > - Eswar > > On Nov 27, 2007 7:51 AM, Marvin Humphrey <[EMAIL PROTECTED]> wrote: > > > > > On Nov 26, 2007, at 6:06 PM, Eswar K wrote: > > > > > We essentially are looking at having an implementation for doing > > > search which can return documents having conceptually similar words > > > without necessarily having the original word searched for. > > > > Very challenging. Say someone searches for "LSA" and hits an archived > > > version of the mail you sent to this list. "LSA" is a reasonably > > discriminating term. But so is "Eswar". > > > > If you knew that the original term was "LSA", then you might look for > > documents near it in term vector space. But if you don't know the > > original term, only the content of the document, how do you know > > whether you should look for docs near "lsa" or "eswar"? > > > > Marvin Humphrey > > Rectangular Research > > http://www.rectangular.com/ > > > > > > >