After rebuilding my index over the weekend with termVectors enabled for the
relevant fields, I've run some basic testing against the MoreLikeThis
handler with these settings from the SOLR Wiki {boost=true, mindf=1,
mintf=1}.
My index contains around 20M documents, averaging under 1K of content with
some outliers as large as 4-8K. The total index size on disk is 29G. The
two largest fields are not involved in the searches as they're fairly
"noisy" when it comes to similarity. There is a filter query enabled based
on a document status field, which cuts the number of documents under
consideration approximately in half (11M will pass it).
I've got the MLT handler searching against three fields right now, each of
which typically has 3-10 words, and when searching for 10 matches I'm seeing
results typically in the 700ms to 1.4 seconds range on the first run for a
given document ID. Given that our main use case involves random access, I'm
curious as to if it's normal or not to see results in this range for a query
like this on an index this size. The index is optimized and not being
written to at the time of testing.
I've tried limiting the fields returned to just the ID and the results are
similar, so I don't think this is related to stored fields. I tried
increasing the mintf, but since I'm using small fields that pretty much
resulted in no usable terms being extracted on most documents and thus no
results. Increasing the maxqt seems like it may help sometimes, but not all
the time and at the cost of visibly less relevant results.
Turning on the debug information, it looks like most of the time is spent on
terms that are very common, some of which match hundreds of thousands of
documents. Is the query time just a natural extrapolation of scoring the
large number of documents?
In general, yes.
When you have a subset of terms that occur frequently, there are some
common techniques to avoid taking big performance hits...
* Remove the terms. Easy, but you can no longer search on them,
either as part of an OR query or in a phrase.
* Combine common terms with following terms. This works well, but is
a bit more complex and can significantly grow the size of your index.
Either of the above requires the type of data analysis you're doing,
to generate the target set of common terms.
-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"If you can't find it, you can't fix it"