This is fairly high on our to-do list. I'm inclined to index the bi-words at the same position as the first word, like synonyms.
wunder On 8/13/08 2:27 PM, "Brendan Grainger" <[EMAIL PROTECTED]> wrote: > Hi Ryan, > > We do basically the same thing, using a modified ShingleFilter > (http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc//contrib-analy > zers/org/apache/lucene/analysis/shingle/ShingleFilter.html > ). I have it set up to build 'shingles' of size 2, 3, 4, 5 which I > index into separate fields. If there is a better way of doing this > sort of thing I'd love to know :-) > > Brendan > > On Aug 13, 2008, at 3:59 PM, Ryan McKinley wrote: > >> I'm looking for a way to get common word groups within documents. >> That is, what are the top two, three, ... n word groups within the >> index. >> >> I was messing with indexing adjacent words together (sorry about the >> earlier commit)... is this a reasonable approach? Any other ideas >> for pulling out common phrases? Any simple post processing? >> >> ryan