Christopher,

It's not Lucene or Solr, but have a look at 
http://www.sematext.com/products/key-phrase-extractor/index.html 


There is an unofficial demo for it (uses Reuters news feeds with 2 1-week long 
windows for SIPs):

  http://www.sematext.com/demo/kpe/i.html

(it looks like the CollateFilter option on the left is kaput, so ignore it -- 
though that filter is actually quite useful and without it you may see some 
phrase overlap)

Otis
--
Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch



----- Original Message ----
> From: Subscriptions <sub.scripti...@metaheuristica.com>
> To: solr-user@lucene.apache.org
> Sent: Sun, December 27, 2009 9:43:56 PM
> Subject: Using IDF to find Collactions and SIPs . . ?
> 
> I am trying to write a query analyzer to pull:
> 
> 
> 
> 1.    Common phrases (also known as Collocations) with in a query
> 
> 
> 
> 2.    Highly unusual phrases (also known as Statistically Improbable
> Phrases or SIPs) with in a query
> 
> 
> 
> The Collocations would be similar to facets except I am also trying to get
> multi word phrases as well as single terms. So suppose I could write
> something that does a chained query off the facet query looking for words in
> proximity. Conceptually (as I understand it) this should just be a question
> of using the IDF (inverse document frequency i.e. the measure of how often
> the term appears across the index).
> 
> 
> 
> *         Has anyone tried to write an analyzer that looks for the words
> that typically occur within a given proximity of another word?
> 
> 
> 
> The highly unusual phrases on the other hand requires getting a handle on
> the IDF which at present only appears to be available via the explain
> function of debugging. 
> 
> 
> 
> *         Has anyone written something to go directly after the IDF score
> only?
> 
> 
> 
> *         If I do have to go down the path of writing this from scratch is
> the org.apache.lucene.search.Similarity class the one to leverage?
> 
> 
> 
> Most grateful for any feedback or insights,
> 
> 
> 
> Christopher 

Reply via email to