Sure, prototyping is the best answer ;). Well, the biggest question is "how long does each shard spend on the query"?
Adding more shards will likely decrease the time each shard takes to process the first-pass query. But if your base is 50ms, and sharding some more takes it to 40ms I don't think it's worth it. I don't think it's worth it b/c a small savings is likely dwarfed by second-pass processing. Here's the scenario, say you're returning 10 rows. > node1 receives the original request. > node1 forwards sub-requests to one replica of each shard. > node1 receives the top 10 results from each replica but all that's returned is the doc ID and sort criteria. > node1 sorts all those lists and determines the true top 10 > node1 then issues a query like "q=id:(1 5 33)" to each replica to get the docs in the top 10 that came from that replica on the first pass. > node1 then assembles the true top 10 into a list to return to the client. And, as you add more shards, you encounter the "laggard problem", the likelihood that you hit one node during garbage collection or anything else that would cause the processing to be slower will slow down the entire query. The fastest way I can think of to prototype would be to set up your load tester with realistic queries (or, indeed, queries from your logs if you have them) and append &distrib=false to them all (or just use a single non-sharded machine) Then start out with, say, 10M docs on a test machine, run a load test. Add 10M more and test and so on until you have good numbers. Frankly, though, my suspicion is you'll be fine with 25M docs/shard.. Best, Erick On Wed, Sep 23, 2015 at 7:08 AM, Alessandro Benedetti < benedetti.ale...@gmail.com> wrote: > It's quite common to hear about the benefit of sharding. > Until we reach the I/O bound on our machines, sharding is likely to reduce > the query time. > Furthermore working on smaller indexes will make the single searches faster > on the smaller nodes. > But what about the other way around ? > What if we actually shard much more than needed ? > Are we going to see also an increase in the query time ( due to the > overhead of query distribution and aggregation of results ? ) > > Example : 100.000.000 docs across 16 shards . > Wouldn't be more effective to have 4 or 8 shards maximum ? > I suspect that prototyping is the right answer, but do we have any general > suggestions strongly motivated ? > Any resource to study ? > > Cheers > > -- > -------------------------- > > Benedetti Alessandro > Visiting card - http://about.me/alessandro_benedetti > Blog - http://alexbenedetti.blogspot.co.uk > > "Tyger, tyger burning bright > In the forests of the night, > What immortal hand or eye > Could frame thy fearful symmetry?" > > William Blake - Songs of Experience -1794 England >