Hi Peyman, I never saw this mentioned on Lucene/Solr MLs, so if anyone has done any work on this, I don't think it was shared.
Otis ---- Performance Monitoring for Solr / ElasticSearch / HBase - http://sematext.com/spm >________________________________ > From: Peyman Faratin <pey...@robustlinks.com> >To: solr-user@lucene.apache.org >Sent: Monday, April 23, 2012 12:29 PM >Subject: Kernel methods in SOLR > >Hi > >Has there been any work that tries to integrate Kernel methods [1] with SOLR? >I am interested in using kernel methods to solve synonym, hyponym and >polysemous (disambiguation) problems which SOLR's Vector space model ("bag of >words") does not capture. > >For example, imagine we have only 3 words in our corpus, "puma", "cougar" and >"feline". The 3 words have obviously interdependencies (puma disambiguates to >cougar, cougar and puma are instances of felines - hyponyms). Now, imagine 2 >docs, d1 and d2, that have the following TF-IDF vectors. > > puma, cougar, feline >d1 = [ 2, 0, 0] >d2 = [ 0, 1, 0] > >i.e. d1 has no mention of term cougar or feline and conversely, d2 has no >mention of terms puma or feline. Hence under the vector approach d1 and d2 are >not related at all (and each interpretation of the terms have a unique >vector). Which is not what we want to conclude. > >What I need is to include a kernel matrix (as data) such as the following that >captures these relationships: > > puma, cougar, feline >puma = [ 1, 1, 0.4] >cougar = [ 1, 1, 0.4] >feline = [ 0.4, 0.4, 1] > >then recompute the TF-IDF vector as a product of (1) the original vector and >(2) the kernel matrix, resulting in > > puma, cougar, feline >d1 = [ 2, 2, 0.8] >d2 = [ 1, 1, 0.4] > >(note, the new vectors are much less sparse). > >I can solve this problem (inefficiently) at the application layer but I was >wondering if there has been any attempts within the community to solve similar >problems, efficiently without paying a hefty response time price? > >thank you > >Peyman > >[1] http://en.wikipedia.org/wiki/Kernel_methods > >