Hi Brendan, What modifications have you made to ShingleFilter? Can you share them?
Karl Wettin recently contributed ShingleMatrixFilter to Lucene - among other things, it can generate shingles of more than one size (check the test cases for how to do this): <http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc/contrib-analyzers/org/apache/lucene/analysis/shingle/ShingleMatrixFilter.html> Steve On 08/13/2008 at 5:27 PM, Brendan Grainger wrote: > Hi Ryan, > > We do basically the same thing, using a modified ShingleFilter > (http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javado > c//contrib-analyzers/org/apache/lucene/analysis/shingle/Shingl > eFilter.html ). I have it set up to build 'shingles' of size 2, 3, 4, 5 > which I index into separate fields. If there is a better way of doing > this sort of thing I'd love to know :-) > > Brendan > > On Aug 13, 2008, at 3:59 PM, Ryan McKinley wrote: > > > I'm looking for a way to get common word groups within documents. > > That is, what are the top two, three, ... n word groups within the > > index. > > > > I was messing with indexing adjacent words together (sorry about the > > earlier commit)... is this a reasonable approach? Any other ideas for > > pulling out common phrases? Any simple post processing? > > > > ryan > >