Thanks for the ideas - some followup questions in-line below:

> * use shingles e.g. to turn two-word phrases into single terms (how
> long is your average phrase?).

Would this be different than what I was calling "common grams"? (other
than shingling every two words, rather than just common ones?)


> * in addition to the above, maybe for phrases with > 2 terms, consider
> just a boolean conjunction of the shingled phrases instead of a "real"
> phrase query: e.g. "more like this" -> (more_like AND like_this). This
> would have some false positives.

This would definitely help, but, IIRC, we moved to phrase queries due
to too many false positives, it would be an interesting experiment to
see how many false positives were left when shingling and then just
doing conjunctive queries.


> * use a more aggressive stopwords list for your "MorePhrasesLikeThis".
> * reduce this number 200, and instead work harder to prune out which
> phrases are the "most descriptive" from the seed document, e.g. based
> on some heuristics like their frequency or location within that seed
> document, so your query isnt so massive.

This is something I've been asking for (perform some sort of PCA /
feature selection on the actual terms used) but is of questionable
value and hard to do "right" so hasn't happened yet (it's not clear
that there will be terms that are very common that are not also very
descriptive, so the extent to which this would help is unknown).

Thanks again for the ideas!
     Aaron

Reply via email to