Re: What should focus be on hardware for solr servers?

Otis Gospodnetic Thu, 14 Feb 2013 18:33:19 -0800

You could run Lucene benchmark stuff and compare. Or look at
ActionGenerator from Sematext on Github which you could also use for
performance testing and comparing.


Otis
Solr & ElasticSearch Support
http://sematext.com/
On Feb 14, 2013 10:56 AM, "Michael Della Bitta" <
michael.della.bi...@appinions.com> wrote:

> Or perhaps we should develop our own, Solr-based benchmark...
>
> Michael Della Bitta
>
> ------------------------------------------------
> Appinions
> 18 East 41st Street, 2nd Floor
> New York, NY 10017-6271
>
> www.appinions.com
>
> Where Influence Isn’t a Game
>
>
> On Thu, Feb 14, 2013 at 10:54 AM, Michael Della Bitta
> <michael.della.bi...@appinions.com> wrote:
> > My dual-core, HT-enabled Dell Latitude from last year has this CPU:
> > model name : Intel(R) Core(TM) i5-2520M CPU @ 2.50GHz
> > bogomips: 4988.65
> >
> > An m3.xlarge reports:
> > model name : Intel(R) Xeon(R) CPU           E5645  @ 2.40GHz
> > bogomips : 4000.14
> >
> > I tried running geekbench and phoronx-test-suite and failed at both...
> > Anybody have a favorite, free, CLI benchmarking suite?
> >
> > Michael Della Bitta
> >
> > ------------------------------------------------
> > Appinions
> > 18 East 41st Street, 2nd Floor
> > New York, NY 10017-6271
> >
> > www.appinions.com
> >
> > Where Influence Isn’t a Game
> >
> >
> > On Thu, Feb 14, 2013 at 8:10 AM, Jack Krupansky <j...@basetechnology.com>
> wrote:
> >> That raises the question of how your average professional notebook
> computer
> >> (PC or Mac or Linux) compares to a garden-variety cloud server such as
> an
> >> Amazon EC2 m1.large (or m3.xlarge) in terms of performance such as
> document
> >> ingestion rate or how many documents you can load before load and/or
> query
> >> performance starts to fall off the cliff. Anybody have any numbers? I
> mean,
> >> is a MacBook Pro half of an EC2 m1.large? Twice? Less? More? Any rough
> feel?
> >> (With all the usual caveats that "it all depends" and "your mileage will
> >> vary.) But the intent would be for a similar workload on both (like
> loading
> >> the wikipedia dump.)
> >>
> >> -- Jack Krupansky
> >>
> >> -----Original Message----- From: Erick Erickson
> >> Sent: Thursday, February 14, 2013 7:31 AM
> >> To: solr-user@lucene.apache.org
> >> Subject: Re: What should focus be on hardware for solr servers?
> >>
> >>
> >> One data point: I can comfortably index and search the Wikipedia dump
> (11M
> >> articles, 5M with text) on my Macbook Pro. Admittedly not heavy-duty
> >> queries, but....
> >>
> >> Erick
> >>
> >>
> >> On Wed, Feb 13, 2013 at 4:01 PM, Matthew Shapiro <m...@mshapiro.net>
> wrote:
> >>
> >>> Excellent, thank you very much for the reply!
> >>>
> >>> On Wed, Feb 13, 2013 at 2:08 PM, Toke Eskildsen <
> t...@statsbiblioteket.dk
> >>> >wrote:
> >>>
> >>> > Matthew Shapiro [m...@mshapiro.net] wrote:
> >>> >
> >>> > > Sorry, I should clarify our current statistics.  First of all I
> meant
> >>> > 183k
> >>> > > documents (not 183, woops). Around 100k of those are full fledged
> html
> >>> > > articles (not web pages but articles in our CMS with html content
> >>> inside
> >>> > > of them),
> >>> >
> >>> > If an article is around 10-30 pages (or the equivalent), this is
> still a
> >>> > small corpus.
> >>> >
> >>> > > the rest of the data are more like key/value data records with a
> lot
> >>> > > of attached meta data for searching.
> >>> >
> >>> > If the amount of unique categories (model, author, playtime, lix,
> >>> > favorite_band, year...) in the meta data is in the lower hundreds,
> you
> >>> > should be fine.
> >>> >
> >>> > > Also, what I meant by search without a search term is that
> probably >
> >>> > > > 80%
> >>> > > (hard to confirm due to the lack of stats given by the GSA) of our
> >>> > searches
> >>> > > are done on pure metadata clauses without any searching through the
> >>> > content
> >>> > > itself,
> >>> >
> >>> > That clarifies a lot, thanks. So we have roughly speaking 4000*5
> >>> > queries/day ~= 14 queries/minute. Guessing wildly that your peak time
> >>> > traffic is about 5 times that, we end up with about 1 query/second.
> That
> >>> is
> >>> > a very light load for the Solr installation we're discussing.
> >>> >
> >>> > > so for example "give me documents that have a content type of
> >>> > > video, that are marked for client X, have a category of Y or Z,
> and >
> >>> > > > was
> >>> > > published to platform A, ordered by date published".
> >>> >
> >>> > That is a near-trivial query and you should get a reply very fast on
> >>> > modest hardware.
> >>> >
> >>> > > The searches that use a search term are more like use the same
> query
> >>> > from the
> >>> > > example as before, but find me all the documents that have the
> string
> >>> > "My Video"
> >>> > > in it's title and description.
> >>> >
> >>> > Unless you experiment with fuzzy matches and phrase slop, this should
> >>> also
> >>> > be fast. Ignoring analyzers, there is practically no difference
> between
> >>> > > a
> >>> > meta data field and a larger content field in Solr.
> >>> >
> >>> > Your current search (guessing here) iterates all terms in the content
> >>> > fields and take a comparatively large penalty when a large document
> is
> >>> > encountered. The inversion of index in Solr means that the search
> terms
> >>> are
> >>> > looked up in a dictionary and refers to the documents they belong
> to. >
> >>> > The
> >>> > penalty for having thousands or millions of terms as compared to
> tens or
> >>> > hundreds in a field in an inverted index is very small.
> >>> >
> >>> > We're still in "any random machine you've got available"-land so I >
> >>> > second
> >>> > Michael's suggestion.
> >>> >
> >>> > Regards,
> >>> > Toke Eskildsen
> >>>
> >>
>

Re: What should focus be on hardware for solr servers?

Reply via email to