Thanks Erick, your insights are really useful. Honestly I agree and when the prototyping will come I will definitely proceed like you suggested !
Cheers 2015-09-23 16:53 GMT+01:00 Erick Erickson <erickerick...@gmail.com>: > Sure, prototyping is the best answer ;). > > Well, the biggest question is "how long > does each shard spend on the query"? > > Adding more shards will likely decrease > the time each shard takes to process the > first-pass query. But if your base is > 50ms, and sharding some more takes it to > 40ms I don't think it's worth it. > > I don't think it's worth it b/c a small savings > is likely dwarfed by second-pass processing. > Here's the scenario, say you're returning 10 rows. > > > node1 receives the original request. > > node1 forwards sub-requests to one replica of each shard. > > node1 receives the top 10 results from each replica > but all that's returned is the doc ID and sort criteria. > > node1 sorts all those lists and determines the true top 10 > > node1 then issues a query like "q=id:(1 5 33)" to each > replica to get the docs in the top 10 that came from that > replica on the first pass. > > node1 then assembles the true top 10 into a list to return > to the client. > > And, as you add more shards, you encounter the "laggard problem", > the likelihood that you hit one node during garbage collection or > anything else that would cause the processing to be slower > will slow down the entire query. > > The fastest way I can think of to prototype would be to set up > your load tester with realistic queries (or, indeed, queries from > your logs if you have them) and append &distrib=false to them > all (or just use a single non-sharded machine) > Then start out with, say, 10M docs on a test machine, run > a load test. Add 10M more and test and so on until you have good > numbers. > > Frankly, though, my suspicion is you'll be fine with 25M docs/shard.. > > Best, > Erick > > On Wed, Sep 23, 2015 at 7:08 AM, Alessandro Benedetti < > benedetti.ale...@gmail.com> wrote: > > > It's quite common to hear about the benefit of sharding. > > Until we reach the I/O bound on our machines, sharding is likely to > reduce > > the query time. > > Furthermore working on smaller indexes will make the single searches > faster > > on the smaller nodes. > > But what about the other way around ? > > What if we actually shard much more than needed ? > > Are we going to see also an increase in the query time ( due to the > > overhead of query distribution and aggregation of results ? ) > > > > Example : 100.000.000 docs across 16 shards . > > Wouldn't be more effective to have 4 or 8 shards maximum ? > > I suspect that prototyping is the right answer, but do we have any > general > > suggestions strongly motivated ? > > Any resource to study ? > > > > Cheers > > > > -- > > -------------------------- > > > > Benedetti Alessandro > > Visiting card - http://about.me/alessandro_benedetti > > Blog - http://alexbenedetti.blogspot.co.uk > > > > "Tyger, tyger burning bright > > In the forests of the night, > > What immortal hand or eye > > Could frame thy fearful symmetry?" > > > > William Blake - Songs of Experience -1794 England > > > -- -------------------------- Benedetti Alessandro Visiting card - http://about.me/alessandro_benedetti Blog - http://alexbenedetti.blogspot.co.uk "Tyger, tyger burning bright In the forests of the night, What immortal hand or eye Could frame thy fearful symmetry?" William Blake - Songs of Experience -1794 England