I am trying to write a query analyzer to pull:
1. Common phrases (also known as Collocations) with in a query 2. Highly unusual phrases (also known as Statistically Improbable Phrases or SIPs) with in a query The Collocations would be similar to facets except I am also trying to get multi word phrases as well as single terms. So suppose I could write something that does a chained query off the facet query looking for words in proximity. Conceptually (as I understand it) this should just be a question of using the IDF (inverse document frequency i.e. the measure of how often the term appears across the index). * Has anyone tried to write an analyzer that looks for the words that typically occur within a given proximity of another word? The highly unusual phrases on the other hand requires getting a handle on the IDF which at present only appears to be available via the explain function of debugging. * Has anyone written something to go directly after the IDF score only? * If I do have to go down the path of writing this from scratch is the org.apache.lucene.search.Similarity class the one to leverage? Most grateful for any feedback or insights, Christopher