Sure, prototyping is the best answer ;).

Well, the biggest question is "how long
does each shard spend on the query"?

Adding more shards will likely decrease
the time each shard takes to process the
first-pass query. But if your base is
50ms, and sharding some more takes it to
40ms I don't think it's worth it.

I don't think it's worth it b/c a small savings
is likely dwarfed by second-pass processing.
Here's the scenario, say you're returning 10 rows.

> node1 receives the original request.
> node1 forwards sub-requests to one replica of each shard.
> node1 receives the top 10 results from each replica
   but all that's returned is the doc ID and sort criteria.
> node1 sorts all those lists and determines the true top 10
> node1 then issues a query like "q=id:(1 5 33)" to each
   replica to get the docs in the top 10 that came from that
   replica on the first pass.
> node1 then assembles the true top 10 into a list to return
   to the client.

And, as you add more shards, you encounter the "laggard problem",
the likelihood that you hit one node during garbage collection or
anything else that would cause the processing to be slower
will slow down the entire query.

The fastest way I can think of to prototype would be to set up
your load tester with realistic queries (or, indeed, queries from
your logs if you have them) and append &distrib=false to them
all (or just use a single non-sharded machine)
Then start out with, say, 10M docs on a test machine, run
a load test. Add 10M more and test and so on until you have good
numbers.

Frankly, though, my suspicion is you'll be fine with 25M docs/shard..

Best,
Erick

On Wed, Sep 23, 2015 at 7:08 AM, Alessandro Benedetti <
benedetti.ale...@gmail.com> wrote:

> It's quite common to hear about the benefit of sharding.
> Until we reach the I/O bound on our machines, sharding is likely to reduce
> the query time.
> Furthermore working on smaller indexes will make the single searches faster
> on the smaller nodes.
> But what about the other way around ?
> What if we actually shard much more than needed ?
> Are we going to see also an increase in the query time ( due to the
> overhead of query distribution and aggregation of results ? )
>
> Example : 100.000.000 docs across 16 shards .
> Wouldn't be more effective to have 4 or 8 shards maximum ?
> I suspect that prototyping is the right answer, but do we have any general
> suggestions strongly motivated  ?
> Any resource to study ?
>
> Cheers
>
> --
> --------------------------
>
> Benedetti Alessandro
> Visiting card - http://about.me/alessandro_benedetti
> Blog - http://alexbenedetti.blogspot.co.uk
>
> "Tyger, tyger burning bright
> In the forests of the night,
> What immortal hand or eye
> Could frame thy fearful symmetry?"
>
> William Blake - Songs of Experience -1794 England
>

Reply via email to