I'm currently looking at methods of term extraction and automatic keyword generation from indexed documents. I've been experimenting with MoreLikeThis and values returned by the "mlt.interestingTerms" parameter and so far this approach has worked well. However, I'd like to be able to analyze documents more intelligently to recognize phrase keywords such as "open source", "Microsoft Office", "Bill Gates" rather than splitting each word into separate tokens (the field is never used in search queries so matching is not an issue). I've been looking at SynonymFilterFactory as a possible solution to this problem but haven't been able to work out the specifics of how to configure it for phrase mappings.
Has anybody else dealt with this problem before or able to offer any insights into achieve the desired results? Thanks in advance, Pieter