On Wed, Oct 24, 2012 at 11:09 AM, Aaron Daubman <daub...@gmail.com> wrote: > Greetings, > > We have a solr instance in use that gets some perhaps atypical queries > and suffers from poor (>2 second) QTimes. > > Documents (~2,350,000) in this instance are mainly comprised of > various "descriptive fields", such as multi-word (phrase) tags - an > average document contains 200-400 phrases like this across several > different multi-valued field types. > > A custom QueryComponent has been built that functions somewhat like a > very specific MoreLikeThis. A seed document is specified via the > incoming query, its terms are retrieved, boosted both by query > parameters as well as fields within the document that specify term > weighting, sorted by this custom boosting, and then a second query is > crafted by taking the top 200 (sorted by the custom boosting) > resulting field values paired with their fields and searching for > documents matching these 200 values.
a few more ideas: * use shingles e.g. to turn two-word phrases into single terms (how long is your average phrase?). * in addition to the above, maybe for phrases with > 2 terms, consider just a boolean conjunction of the shingled phrases instead of a "real" phrase query: e.g. "more like this" -> (more_like AND like_this). This would have some false positives. * use a more aggressive stopwords list for your "MorePhrasesLikeThis". * reduce this number 200, and instead work harder to prune out which phrases are the "most descriptive" from the seed document, e.g. based on some heuristics like their frequency or location within that seed document, so your query isnt so massive.