One data point: I can comfortably index and search the Wikipedia dump (11M articles, 5M with text) on my Macbook Pro. Admittedly not heavy-duty queries, but....
Erick On Wed, Feb 13, 2013 at 4:01 PM, Matthew Shapiro <m...@mshapiro.net> wrote: > Excellent, thank you very much for the reply! > > On Wed, Feb 13, 2013 at 2:08 PM, Toke Eskildsen <t...@statsbiblioteket.dk > >wrote: > > > Matthew Shapiro [m...@mshapiro.net] wrote: > > > > > Sorry, I should clarify our current statistics. First of all I meant > > 183k > > > documents (not 183, woops). Around 100k of those are full fledged html > > > articles (not web pages but articles in our CMS with html content > inside > > > of them), > > > > If an article is around 10-30 pages (or the equivalent), this is still a > > small corpus. > > > > > the rest of the data are more like key/value data records with a lot > > > of attached meta data for searching. > > > > If the amount of unique categories (model, author, playtime, lix, > > favorite_band, year...) in the meta data is in the lower hundreds, you > > should be fine. > > > > > Also, what I meant by search without a search term is that probably 80% > > > (hard to confirm due to the lack of stats given by the GSA) of our > > searches > > > are done on pure metadata clauses without any searching through the > > content > > > itself, > > > > That clarifies a lot, thanks. So we have roughly speaking 4000*5 > > queries/day ~= 14 queries/minute. Guessing wildly that your peak time > > traffic is about 5 times that, we end up with about 1 query/second. That > is > > a very light load for the Solr installation we're discussing. > > > > > so for example "give me documents that have a content type of > > > video, that are marked for client X, have a category of Y or Z, and was > > > published to platform A, ordered by date published". > > > > That is a near-trivial query and you should get a reply very fast on > > modest hardware. > > > > > The searches that use a search term are more like use the same query > > from the > > > example as before, but find me all the documents that have the string > > "My Video" > > > in it's title and description. > > > > Unless you experiment with fuzzy matches and phrase slop, this should > also > > be fast. Ignoring analyzers, there is practically no difference between a > > meta data field and a larger content field in Solr. > > > > Your current search (guessing here) iterates all terms in the content > > fields and take a comparatively large penalty when a large document is > > encountered. The inversion of index in Solr means that the search terms > are > > looked up in a dictionary and refers to the documents they belong to. The > > penalty for having thousands or millions of terms as compared to tens or > > hundreds in a field in an inverted index is very small. > > > > We're still in "any random machine you've got available"-land so I second > > Michael's suggestion. > > > > Regards, > > Toke Eskildsen >