Matthew Shapiro [[email protected]] wrote: > Sorry, I should clarify our current statistics. First of all I meant 183k > documents (not 183, woops). Around 100k of those are full fledged html > articles (not web pages but articles in our CMS with html content inside > of them),
If an article is around 10-30 pages (or the equivalent), this is still a small corpus. > the rest of the data are more like key/value data records with a lot > of attached meta data for searching. If the amount of unique categories (model, author, playtime, lix, favorite_band, year...) in the meta data is in the lower hundreds, you should be fine. > Also, what I meant by search without a search term is that probably 80% > (hard to confirm due to the lack of stats given by the GSA) of our searches > are done on pure metadata clauses without any searching through the content > itself, That clarifies a lot, thanks. So we have roughly speaking 4000*5 queries/day ~= 14 queries/minute. Guessing wildly that your peak time traffic is about 5 times that, we end up with about 1 query/second. That is a very light load for the Solr installation we're discussing. > so for example "give me documents that have a content type of > video, that are marked for client X, have a category of Y or Z, and was > published to platform A, ordered by date published". That is a near-trivial query and you should get a reply very fast on modest hardware. > The searches that use a search term are more like use the same query from the > example as before, but find me all the documents that have the string "My > Video" > in it's title and description. Unless you experiment with fuzzy matches and phrase slop, this should also be fast. Ignoring analyzers, there is practically no difference between a meta data field and a larger content field in Solr. Your current search (guessing here) iterates all terms in the content fields and take a comparatively large penalty when a large document is encountered. The inversion of index in Solr means that the search terms are looked up in a dictionary and refers to the documents they belong to. The penalty for having thousands or millions of terms as compared to tens or hundreds in a field in an inverted index is very small. We're still in "any random machine you've got available"-land so I second Michael's suggestion. Regards, Toke Eskildsen
