Re: LSA Implementation

Grant Ingersoll Tue, 27 Nov 2007 17:13:37 -0800

Using Wordnet may require having some type of disambiguation approach,otherwise you can end up w/ a lot of "synonyms". I also would lookinto how much coverage there is for non-English languages.

If you have the resources, you may be better off developing/findingyour own synonym/concept list based on your genres. You may also lookinto other approaches for assigning concepts off line and adding themto the document.


-Grant

On Nov 27, 2007, at 3:21 PM, Norskog, Lance wrote:

WordNet itself is English-only. There are various ontology projectsfor

it.

http://www.globalwordnet.org/ is a separate world language database

project. I found it at the bottom of the WordNet wikipedia page.Thanks

for starting me on the search!

Lance

-----Original Message-----
From: Eswar K [mailto:[EMAIL PROTECTED]
Sent: Monday, November 26, 2007 6:50 PM
To: solr-user@lucene.apache.org
Subject: Re: LSA Implementation

The languages also include CJK :) among others.

- Eswar

On Nov 27, 2007 8:16 AM, Norskog, Lance <[EMAIL PROTECTED]> wrote:

The WordNet project at Princeton (USA) is a large database of

synonyms.

If you're only working in English this might be useful instead of
running your own analyses.

http://en.wikipedia.org/wiki/WordNet
http://wordnet.princeton.edu/

Lance

-----Original Message-----
From: Eswar K [mailto:[EMAIL PROTECTED]
Sent: Monday, November 26, 2007 6:34 PM
To: solr-user@lucene.apache.org
Subject: Re: LSA Implementation

In addition to recording which keywords a document contains, the

method examines the document collection as a whole, to see whichother

documents contain some of those same words. this algo should consider
documents that have many words in common to be semantically close,and

ones with few words in common to be semantically distant. This simple
method correlates surprisingly well with how a human being, lookingat

content, might classify a document collection. Although the algorithm
doesn't understand anything about what the words *mean*, the patterns
it notices can make it seem astonishingly intelligent.

When you search an such  an index, the search engine looks at
similarity values it has calculated for every content word, and
returns the documents that it thinks best fit the query. Because two
documents may be semantically very close even if they do not share a
particular keyword,

Where a plain keyword search will fail if there is no exact match,
this algo will often return relevant documents that don't contain the
keyword at all.

- Eswar

On Nov 27, 2007 7:51 AM, Marvin Humphrey <[EMAIL PROTECTED]>

wrote:


On Nov 26, 2007, at 6:06 PM, Eswar K wrote:

We essentially are looking at having an implementation for doing
search which can return documents having conceptually similar
words without necessarily having the original word searched for.


Very challenging.  Say someone searches for "LSA" and hits an
archived

version of the mail you sent to this list.  "LSA" is a reasonably
discriminating term.  But so is "Eswar".

If you knew that the original term was "LSA", then you might look
for documents near it in term vector space.  But if you don't know
the original term, only the content of the document, how do you know

whether you should look for docs near "lsa" or "eswar"?

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


--------------------------
Grant Ingersoll
http://lucene.grantingersoll.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ

Re: LSA Implementation

Reply via email to