Thanks for the ideas - some followup questions in-line below:
> * use shingles e.g. to turn two-word phrases into single terms (how > long is your average phrase?). Would this be different than what I was calling "common grams"? (other than shingling every two words, rather than just common ones?) > * in addition to the above, maybe for phrases with > 2 terms, consider > just a boolean conjunction of the shingled phrases instead of a "real" > phrase query: e.g. "more like this" -> (more_like AND like_this). This > would have some false positives. This would definitely help, but, IIRC, we moved to phrase queries due to too many false positives, it would be an interesting experiment to see how many false positives were left when shingling and then just doing conjunctive queries. > * use a more aggressive stopwords list for your "MorePhrasesLikeThis". > * reduce this number 200, and instead work harder to prune out which > phrases are the "most descriptive" from the seed document, e.g. based > on some heuristics like their frequency or location within that seed > document, so your query isnt so massive. This is something I've been asking for (perform some sort of PCA / feature selection on the actual terms used) but is of questionable value and hard to do "right" so hasn't happened yet (it's not clear that there will be terms that are very common that are not also very descriptive, so the extent to which this would help is unknown). Thanks again for the ideas! Aaron