This is fairly high on our to-do list. I'm inclined to index the
bi-words at the same position as the first word, like synonyms.

wunder

On 8/13/08 2:27 PM, "Brendan Grainger" <[EMAIL PROTECTED]> wrote:

> Hi Ryan,
> 
> We do basically the same thing, using a modified ShingleFilter
> (http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc//contrib-analy
> zers/org/apache/lucene/analysis/shingle/ShingleFilter.html
> ). I have it set up to build 'shingles' of size 2, 3, 4, 5 which I
> index into separate fields. If there is a better way of doing this
> sort of thing I'd love to know :-)
> 
> Brendan
> 
> On Aug 13, 2008, at 3:59 PM, Ryan McKinley wrote:
> 
>> I'm looking for a way to get common word groups within documents.
>> That is, what are the top two, three, ... n word groups within the
>> index.
>> 
>> I was messing with indexing adjacent words together (sorry about the
>> earlier commit)... is this a reasonable approach?  Any other ideas
>> for pulling out common phrases?  Any simple post processing?
>> 
>> ryan


Reply via email to