Excellent, thank you very much for the reply! On Wed, Feb 13, 2013 at 2:08 PM, Toke Eskildsen <t...@statsbiblioteket.dk>wrote:
> Matthew Shapiro [m...@mshapiro.net] wrote: > > > Sorry, I should clarify our current statistics. First of all I meant > 183k > > documents (not 183, woops). Around 100k of those are full fledged html > > articles (not web pages but articles in our CMS with html content inside > > of them), > > If an article is around 10-30 pages (or the equivalent), this is still a > small corpus. > > > the rest of the data are more like key/value data records with a lot > > of attached meta data for searching. > > If the amount of unique categories (model, author, playtime, lix, > favorite_band, year...) in the meta data is in the lower hundreds, you > should be fine. > > > Also, what I meant by search without a search term is that probably 80% > > (hard to confirm due to the lack of stats given by the GSA) of our > searches > > are done on pure metadata clauses without any searching through the > content > > itself, > > That clarifies a lot, thanks. So we have roughly speaking 4000*5 > queries/day ~= 14 queries/minute. Guessing wildly that your peak time > traffic is about 5 times that, we end up with about 1 query/second. That is > a very light load for the Solr installation we're discussing. > > > so for example "give me documents that have a content type of > > video, that are marked for client X, have a category of Y or Z, and was > > published to platform A, ordered by date published". > > That is a near-trivial query and you should get a reply very fast on > modest hardware. > > > The searches that use a search term are more like use the same query > from the > > example as before, but find me all the documents that have the string > "My Video" > > in it's title and description. > > Unless you experiment with fuzzy matches and phrase slop, this should also > be fast. Ignoring analyzers, there is practically no difference between a > meta data field and a larger content field in Solr. > > Your current search (guessing here) iterates all terms in the content > fields and take a comparatively large penalty when a large document is > encountered. The inversion of index in Solr means that the search terms are > looked up in a dictionary and refers to the documents they belong to. The > penalty for having thousands or millions of terms as compared to tens or > hundreds in a field in an inverted index is very small. > > We're still in "any random machine you've got available"-land so I second > Michael's suggestion. > > Regards, > Toke Eskildsen