Thanks Erick, your insights are really useful.
Honestly I agree and when the prototyping will come I will definitely
proceed like you suggested !

Cheers

2015-09-23 16:53 GMT+01:00 Erick Erickson <erickerick...@gmail.com>:

> Sure, prototyping is the best answer ;).
>
> Well, the biggest question is "how long
> does each shard spend on the query"?
>
> Adding more shards will likely decrease
> the time each shard takes to process the
> first-pass query. But if your base is
> 50ms, and sharding some more takes it to
> 40ms I don't think it's worth it.
>
> I don't think it's worth it b/c a small savings
> is likely dwarfed by second-pass processing.
> Here's the scenario, say you're returning 10 rows.
>
> > node1 receives the original request.
> > node1 forwards sub-requests to one replica of each shard.
> > node1 receives the top 10 results from each replica
>    but all that's returned is the doc ID and sort criteria.
> > node1 sorts all those lists and determines the true top 10
> > node1 then issues a query like "q=id:(1 5 33)" to each
>    replica to get the docs in the top 10 that came from that
>    replica on the first pass.
> > node1 then assembles the true top 10 into a list to return
>    to the client.
>
> And, as you add more shards, you encounter the "laggard problem",
> the likelihood that you hit one node during garbage collection or
> anything else that would cause the processing to be slower
> will slow down the entire query.
>
> The fastest way I can think of to prototype would be to set up
> your load tester with realistic queries (or, indeed, queries from
> your logs if you have them) and append &distrib=false to them
> all (or just use a single non-sharded machine)
> Then start out with, say, 10M docs on a test machine, run
> a load test. Add 10M more and test and so on until you have good
> numbers.
>
> Frankly, though, my suspicion is you'll be fine with 25M docs/shard..
>
> Best,
> Erick
>
> On Wed, Sep 23, 2015 at 7:08 AM, Alessandro Benedetti <
> benedetti.ale...@gmail.com> wrote:
>
> > It's quite common to hear about the benefit of sharding.
> > Until we reach the I/O bound on our machines, sharding is likely to
> reduce
> > the query time.
> > Furthermore working on smaller indexes will make the single searches
> faster
> > on the smaller nodes.
> > But what about the other way around ?
> > What if we actually shard much more than needed ?
> > Are we going to see also an increase in the query time ( due to the
> > overhead of query distribution and aggregation of results ? )
> >
> > Example : 100.000.000 docs across 16 shards .
> > Wouldn't be more effective to have 4 or 8 shards maximum ?
> > I suspect that prototyping is the right answer, but do we have any
> general
> > suggestions strongly motivated  ?
> > Any resource to study ?
> >
> > Cheers
> >
> > --
> > --------------------------
> >
> > Benedetti Alessandro
> > Visiting card - http://about.me/alessandro_benedetti
> > Blog - http://alexbenedetti.blogspot.co.uk
> >
> > "Tyger, tyger burning bright
> > In the forests of the night,
> > What immortal hand or eye
> > Could frame thy fearful symmetry?"
> >
> > William Blake - Songs of Experience -1794 England
> >
>



-- 
--------------------------

Benedetti Alessandro
Visiting card - http://about.me/alessandro_benedetti
Blog - http://alexbenedetti.blogspot.co.uk

"Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?"

William Blake - Songs of Experience -1794 England

Reply via email to