Christopher, It's not Lucene or Solr, but have a look at http://www.sematext.com/products/key-phrase-extractor/index.html
There is an unofficial demo for it (uses Reuters news feeds with 2 1-week long windows for SIPs): http://www.sematext.com/demo/kpe/i.html (it looks like the CollateFilter option on the left is kaput, so ignore it -- though that filter is actually quite useful and without it you may see some phrase overlap) Otis -- Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch ----- Original Message ---- > From: Subscriptions <sub.scripti...@metaheuristica.com> > To: solr-user@lucene.apache.org > Sent: Sun, December 27, 2009 9:43:56 PM > Subject: Using IDF to find Collactions and SIPs . . ? > > I am trying to write a query analyzer to pull: > > > > 1. Common phrases (also known as Collocations) with in a query > > > > 2. Highly unusual phrases (also known as Statistically Improbable > Phrases or SIPs) with in a query > > > > The Collocations would be similar to facets except I am also trying to get > multi word phrases as well as single terms. So suppose I could write > something that does a chained query off the facet query looking for words in > proximity. Conceptually (as I understand it) this should just be a question > of using the IDF (inverse document frequency i.e. the measure of how often > the term appears across the index). > > > > * Has anyone tried to write an analyzer that looks for the words > that typically occur within a given proximity of another word? > > > > The highly unusual phrases on the other hand requires getting a handle on > the IDF which at present only appears to be available via the explain > function of debugging. > > > > * Has anyone written something to go directly after the IDF score > only? > > > > * If I do have to go down the path of writing this from scratch is > the org.apache.lucene.search.Similarity class the one to leverage? > > > > Most grateful for any feedback or insights, > > > > Christopher