Greetings,

We have a solr instance in use that gets some perhaps atypical queries
and suffers from poor (>2 second) QTimes.

Documents (~2,350,000) in this instance are mainly comprised of
various "descriptive fields", such as multi-word (phrase) tags - an
average document contains 200-400 phrases like this across several
different multi-valued field types.

A custom QueryComponent has been built that functions somewhat like a
very specific MoreLikeThis. A seed document is specified via the
incoming query, its terms are retrieved, boosted both by query
parameters as well as fields within the document that specify term
weighting, sorted by this custom boosting, and then a second query is
crafted by taking the top 200 (sorted by the custom boosting)
resulting field values paired with their fields and searching for
documents matching these 200 values.

For many searches, 25-50% of the documents match the query of 200
terms (so 600,000 to 1,200,000).

After doing some profiling, it seems that a majority of the QTime
comes from dealing with phrases and resulting term positions, since a
majority of the search terms are actually multi-word tokenized
phrases. (processing is dominated by ExactPhraseScorer on down,
particularly: SegmentTermPositions, readVInt)

I have thought of a few ways to improve performance for this use case,
and am looking for feedback as to which seems best, as well as any
insight into other ways to approach this problem that I haven't
considered (or things to look into to help better understand the slow
QTimes more fully):

1) Shard the index - since there is no key to really specify which
shard queries would go to, this would only be of benefit if scoring is
done in parallel. Is there documentation I have so far missed that
describes distributed searching for this case? (I haven't found
anything that really describes the differences in scoring for
distributed vs. non-distributed indices, aside from the warnings that
IDF doesn't work - which I don't think we really care about).

2) Implement "Common Grams" as described here:
http://www.hathitrust.org/blogs/large-scale-search/slow-queries-and-common-words-part-2
It's not clear how many individual words in the phrases being used
are, in fact, common, but given that 25-50% of the documents in the
index match many queries, it seems this may be of value

3) Try and make mm (minimum terms should match) work for the custom
query. I haven't been able to figure out how exactly this parameter
works, but, my thinking is along the lines of "if only 2 of those 200
terms match a document, it doesn't need to get scored". What I don't
currently understand is at what point failing the mm requirement
short-circuits - e.g. does the doc still get scored? If it does
short-circuit prior to scoring, this may help somewhat, although it's
not clear this would still prevent the many many gets against term
positions that is still killing QTime

4) Set a dynamic number (rather than the currently fixed 200) of terms
based on the custom boosting/weighting value - e.g. only use terms
whose calculated value is above some threshold. I'm not keen on this
since some documents may be dominated by many weak terms and not have
any great ones, it it might break for those (finding the "sweet spot"
cutoff would not be straightforward).

5) *This is my current favorite*: stop tokenizing/analyzing these
terms and just use KeywordTokenizer. Most of these phrases are
pre-vetted, and it may be possible to clean/process any others before
creating the docs. My main worry here is that, currently, if I
understand correctly, a document with the phrase "brazilian pop" would
still be returned as a match to a seed document containing only the
phrase "brazilian" (not the other way around, but that is not
necessary), however, with KeywordTokenizer, this would no longer be
the case. If I switched from the current dubious tokenize/stem/etc...
and just used Keyword, would this allow queries like "this used to be
a long phrase query" to match documents that have "this used to be a
long phrase query" as one of the multivalued values in the field
without having to pull term positions? (and thus significantly speed
up performance).

Thanks,
     Aaron

Reply via email to