Hi Ryan,
We do basically the same thing, using a modified ShingleFilter (http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc//contrib-analyzers/org/apache/lucene/analysis/shingle/ShingleFilter.html
). I have it set up to build 'shingles' of size 2, 3, 4, 5 which I
index into separate fields. If there is a better way of doing this
sort of thing I'd love to know :-)
Brendan
On Aug 13, 2008, at 3:59 PM, Ryan McKinley wrote:
I'm looking for a way to get common word groups within documents.
That is, what are the top two, three, ... n word groups within the
index.
I was messing with indexing adjacent words together (sorry about the
earlier commit)... is this a reasonable approach? Any other ideas
for pulling out common phrases? Any simple post processing?
ryan