Lance, It does cover European languages, but pretty much nothing on Asian languages (CJK).
- Eswar On Nov 28, 2007 1:51 AM, Norskog, Lance <[EMAIL PROTECTED]> wrote: > WordNet itself is English-only. There are various ontology projects for > it. > > http://www.globalwordnet.org/ is a separate world language database > project. I found it at the bottom of the WordNet wikipedia page. Thanks > for starting me on the search! > > Lance > > -----Original Message----- > From: Eswar K [mailto:[EMAIL PROTECTED] > Sent: Monday, November 26, 2007 6:50 PM > To: solr-user@lucene.apache.org > Subject: Re: LSA Implementation > > The languages also include CJK :) among others. > > - Eswar > > On Nov 27, 2007 8:16 AM, Norskog, Lance <[EMAIL PROTECTED]> wrote: > > > The WordNet project at Princeton (USA) is a large database of > synonyms. > > If you're only working in English this might be useful instead of > > running your own analyses. > > > > http://en.wikipedia.org/wiki/WordNet > > http://wordnet.princeton.edu/ > > > > Lance > > > > -----Original Message----- > > From: Eswar K [mailto:[EMAIL PROTECTED] > > Sent: Monday, November 26, 2007 6:34 PM > > To: solr-user@lucene.apache.org > > Subject: Re: LSA Implementation > > > > In addition to recording which keywords a document contains, the > > method examines the document collection as a whole, to see which other > > > documents contain some of those same words. this algo should consider > > documents that have many words in common to be semantically close, and > > > ones with few words in common to be semantically distant. This simple > > method correlates surprisingly well with how a human being, looking at > > > content, might classify a document collection. Although the algorithm > > doesn't understand anything about what the words *mean*, the patterns > > it notices can make it seem astonishingly intelligent. > > > > When you search an such an index, the search engine looks at > > similarity values it has calculated for every content word, and > > returns the documents that it thinks best fit the query. Because two > > documents may be semantically very close even if they do not share a > > particular keyword, > > > > Where a plain keyword search will fail if there is no exact match, > > this algo will often return relevant documents that don't contain the > > keyword at all. > > > > - Eswar > > > > On Nov 27, 2007 7:51 AM, Marvin Humphrey <[EMAIL PROTECTED]> > wrote: > > > > > > > > On Nov 26, 2007, at 6:06 PM, Eswar K wrote: > > > > > > > We essentially are looking at having an implementation for doing > > > > search which can return documents having conceptually similar > > > > words without necessarily having the original word searched for. > > > > > > Very challenging. Say someone searches for "LSA" and hits an > > > archived > > > > > version of the mail you sent to this list. "LSA" is a reasonably > > > discriminating term. But so is "Eswar". > > > > > > If you knew that the original term was "LSA", then you might look > > > for documents near it in term vector space. But if you don't know > > > the original term, only the content of the document, how do you know > > > > whether you should look for docs near "lsa" or "eswar"? > > > > > > Marvin Humphrey > > > Rectangular Research > > > http://www.rectangular.com/ > > > > > > > > > > > >