Using IDF to find Collactions and SIPs . . ?

Subscriptions Sun, 27 Dec 2009 18:45:45 -0800

I am trying to write a query analyzer to pull:


1.      Common phrases (also known as Collocations) with in a query

 

2.      Highly unusual phrases (also known as Statistically Improbable
Phrases or SIPs) with in a query

 

The Collocations would be similar to facets except I am also trying to get
multi word phrases as well as single terms. So suppose I could write
something that does a chained query off the facet query looking for words in
proximity. Conceptually (as I understand it) this should just be a question
of using the IDF (inverse document frequency i.e. the measure of how often
the term appears across the index).

 

*         Has anyone tried to write an analyzer that looks for the words
that typically occur within a given proximity of another word?

 

The highly unusual phrases on the other hand requires getting a handle on
the IDF which at present only appears to be available via the explain
function of debugging. 

 

*         Has anyone written something to go directly after the IDF score
only?

 

*         If I do have to go down the path of writing this from scratch is
the org.apache.lucene.search.Similarity class the one to leverage?

 

Most grateful for any feedback or insights,

 

Christopher

Using IDF to find Collactions and SIPs . . ?

Reply via email to