Hi Ryan,

We do basically the same thing, using a modified ShingleFilter (http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc//contrib-analyzers/org/apache/lucene/analysis/shingle/ShingleFilter.html ). I have it set up to build 'shingles' of size 2, 3, 4, 5 which I index into separate fields. If there is a better way of doing this sort of thing I'd love to know :-)

Brendan

On Aug 13, 2008, at 3:59 PM, Ryan McKinley wrote:

I'm looking for a way to get common word groups within documents. That is, what are the top two, three, ... n word groups within the index.

I was messing with indexing adjacent words together (sorry about the earlier commit)... is this a reasonable approach? Any other ideas for pulling out common phrases? Any simple post processing?

ryan

Reply via email to