Greetings, We have a solr instance in use that gets some perhaps atypical queries and suffers from poor (>2 second) QTimes.
Documents (~2,350,000) in this instance are mainly comprised of various "descriptive fields", such as multi-word (phrase) tags - an average document contains 200-400 phrases like this across several different multi-valued field types. A custom QueryComponent has been built that functions somewhat like a very specific MoreLikeThis. A seed document is specified via the incoming query, its terms are retrieved, boosted both by query parameters as well as fields within the document that specify term weighting, sorted by this custom boosting, and then a second query is crafted by taking the top 200 (sorted by the custom boosting) resulting field values paired with their fields and searching for documents matching these 200 values. For many searches, 25-50% of the documents match the query of 200 terms (so 600,000 to 1,200,000). After doing some profiling, it seems that a majority of the QTime comes from dealing with phrases and resulting term positions, since a majority of the search terms are actually multi-word tokenized phrases. (processing is dominated by ExactPhraseScorer on down, particularly: SegmentTermPositions, readVInt) I have thought of a few ways to improve performance for this use case, and am looking for feedback as to which seems best, as well as any insight into other ways to approach this problem that I haven't considered (or things to look into to help better understand the slow QTimes more fully): 1) Shard the index - since there is no key to really specify which shard queries would go to, this would only be of benefit if scoring is done in parallel. Is there documentation I have so far missed that describes distributed searching for this case? (I haven't found anything that really describes the differences in scoring for distributed vs. non-distributed indices, aside from the warnings that IDF doesn't work - which I don't think we really care about). 2) Implement "Common Grams" as described here: http://www.hathitrust.org/blogs/large-scale-search/slow-queries-and-common-words-part-2 It's not clear how many individual words in the phrases being used are, in fact, common, but given that 25-50% of the documents in the index match many queries, it seems this may be of value 3) Try and make mm (minimum terms should match) work for the custom query. I haven't been able to figure out how exactly this parameter works, but, my thinking is along the lines of "if only 2 of those 200 terms match a document, it doesn't need to get scored". What I don't currently understand is at what point failing the mm requirement short-circuits - e.g. does the doc still get scored? If it does short-circuit prior to scoring, this may help somewhat, although it's not clear this would still prevent the many many gets against term positions that is still killing QTime 4) Set a dynamic number (rather than the currently fixed 200) of terms based on the custom boosting/weighting value - e.g. only use terms whose calculated value is above some threshold. I'm not keen on this since some documents may be dominated by many weak terms and not have any great ones, it it might break for those (finding the "sweet spot" cutoff would not be straightforward). 5) *This is my current favorite*: stop tokenizing/analyzing these terms and just use KeywordTokenizer. Most of these phrases are pre-vetted, and it may be possible to clean/process any others before creating the docs. My main worry here is that, currently, if I understand correctly, a document with the phrase "brazilian pop" would still be returned as a match to a seed document containing only the phrase "brazilian" (not the other way around, but that is not necessary), however, with KeywordTokenizer, this would no longer be the case. If I switched from the current dubious tokenize/stem/etc... and just used Keyword, would this allow queries like "this used to be a long phrase query" to match documents that have "this used to be a long phrase query" as one of the multivalued values in the field without having to pull term positions? (and thus significantly speed up performance). Thanks, Aaron