Excellent, thank you very much for the reply!

On Wed, Feb 13, 2013 at 2:08 PM, Toke Eskildsen <t...@statsbiblioteket.dk>wrote:

> Matthew Shapiro [m...@mshapiro.net] wrote:
>
> > Sorry, I should clarify our current statistics.  First of all I meant
> 183k
> > documents (not 183, woops). Around 100k of those are full fledged html
> > articles (not web pages but articles in our CMS with html content inside
> > of them),
>
> If an article is around 10-30 pages (or the equivalent), this is still a
> small corpus.
>
> > the rest of the data are more like key/value data records with a lot
> > of attached meta data for searching.
>
> If the amount of unique categories (model, author, playtime, lix,
> favorite_band, year...) in the meta data is in the lower hundreds, you
> should be fine.
>
> > Also, what I meant by search without a search term is that probably 80%
> > (hard to confirm due to the lack of stats given by the GSA) of our
> searches
> > are done on pure metadata clauses without any searching through the
> content
> > itself,
>
> That clarifies a lot, thanks. So we have roughly speaking 4000*5
> queries/day ~= 14 queries/minute. Guessing wildly that your peak time
> traffic is about 5 times that, we end up with about 1 query/second. That is
> a very light load for the Solr installation we're discussing.
>
> > so for example "give me documents that have a content type of
> > video, that are marked for client X, have a category of Y or Z, and was
> > published to platform A, ordered by date published".
>
> That is a near-trivial query and you should get a reply very fast on
> modest hardware.
>
> > The searches that use a search term are more like use the same query
> from the
> > example as before, but find me all the documents that have the string
> "My Video"
> > in it's title and description.
>
> Unless you experiment with fuzzy matches and phrase slop, this should also
> be fast. Ignoring analyzers, there is practically no difference between a
> meta data field and a larger content field in Solr.
>
> Your current search (guessing here) iterates all terms in the content
> fields and take a comparatively large penalty when a large document is
> encountered. The inversion of index in Solr means that the search terms are
> looked up in a dictionary and refers to the documents they belong to. The
> penalty for having thousands or millions of terms as compared to tens or
> hundreds in a field in an inverted index is very small.
>
> We're still in "any random machine you've got available"-land so I second
> Michael's suggestion.
>
> Regards,
> Toke Eskildsen

Reply via email to