Re: What should focus be on hardware for solr servers?

Erick Erickson Thu, 14 Feb 2013 04:31:39 -0800

One data point: I can comfortably index and search the Wikipedia dump (11M
articles, 5M with text) on my Macbook Pro. Admittedly not heavy-duty
queries, but....


Erick


On Wed, Feb 13, 2013 at 4:01 PM, Matthew Shapiro <m...@mshapiro.net> wrote:

> Excellent, thank you very much for the reply!
>
> On Wed, Feb 13, 2013 at 2:08 PM, Toke Eskildsen <t...@statsbiblioteket.dk
> >wrote:
>
> > Matthew Shapiro [m...@mshapiro.net] wrote:
> >
> > > Sorry, I should clarify our current statistics.  First of all I meant
> > 183k
> > > documents (not 183, woops). Around 100k of those are full fledged html
> > > articles (not web pages but articles in our CMS with html content
> inside
> > > of them),
> >
> > If an article is around 10-30 pages (or the equivalent), this is still a
> > small corpus.
> >
> > > the rest of the data are more like key/value data records with a lot
> > > of attached meta data for searching.
> >
> > If the amount of unique categories (model, author, playtime, lix,
> > favorite_band, year...) in the meta data is in the lower hundreds, you
> > should be fine.
> >
> > > Also, what I meant by search without a search term is that probably 80%
> > > (hard to confirm due to the lack of stats given by the GSA) of our
> > searches
> > > are done on pure metadata clauses without any searching through the
> > content
> > > itself,
> >
> > That clarifies a lot, thanks. So we have roughly speaking 4000*5
> > queries/day ~= 14 queries/minute. Guessing wildly that your peak time
> > traffic is about 5 times that, we end up with about 1 query/second. That
> is
> > a very light load for the Solr installation we're discussing.
> >
> > > so for example "give me documents that have a content type of
> > > video, that are marked for client X, have a category of Y or Z, and was
> > > published to platform A, ordered by date published".
> >
> > That is a near-trivial query and you should get a reply very fast on
> > modest hardware.
> >
> > > The searches that use a search term are more like use the same query
> > from the
> > > example as before, but find me all the documents that have the string
> > "My Video"
> > > in it's title and description.
> >
> > Unless you experiment with fuzzy matches and phrase slop, this should
> also
> > be fast. Ignoring analyzers, there is practically no difference between a
> > meta data field and a larger content field in Solr.
> >
> > Your current search (guessing here) iterates all terms in the content
> > fields and take a comparatively large penalty when a large document is
> > encountered. The inversion of index in Solr means that the search terms
> are
> > looked up in a dictionary and refers to the documents they belong to. The
> > penalty for having thousands or millions of terms as compared to tens or
> > hundreds in a field in an inverted index is very small.
> >
> > We're still in "any random machine you've got available"-land so I second
> > Michael's suggestion.
> >
> > Regards,
> > Toke Eskildsen
>

Re: What should focus be on hardware for solr servers?

Reply via email to