Re: Load balancing with solr cloud

Erick Erickson Fri, 21 Oct 2016 06:57:18 -0700

bq: I did hope that SolrCloud would have a standard load balancing
mechanism for all client types rather than just those using a specific
Java library.


It does. For queries. There is a software load balancer as Garth
mentioned, the "aggregator" node can be farmed out. But for queries
you want to use either
1> a hardware load balancer
or
2> CloudSolrClient (or CloudSolrServer in 4x) from a SolrJ client

because if you use a single HTTP endpoint, even though it'll
load-balance the aggregator (and sub-request to individual replicas),
it's a single point of failure. If that node happens to go down you
serve no queries.

For indexing you wan to use SolrJ and CloudSolrClient if at all
possible. While updating to a random node works, you create a bunch
more network traffic since whatever node receives a document forwards
it to the correct leader, which is a requirement for data integrity.
CloudSolrClient sends each document (or subset of documents if you use
the add(list) form) to the right leader, thus eliminating the extra
hop. That said, if you aren't indexing at a furious rate you probably
won't notice.

About merging hits... It's pretty often a mistake to try to control
this, but testing is always good. Sure, for a "rule of thumb". Don't
shard ;).

Sharding inevitably adds overhead. As long as you get adequate
response times without sharding, don't. If your SLA for query response
is, say, 500ms and you meet that with one shard why bother? If you
need a higher QPS rate, add _replicas_ in this situation.

Only shard when your queries aren't getting served quickly enough, and
you've tuned your single shard as much as possible. "when your queries
aren't getting served quickly enough" is fuzzy, but for "reasonable
documents on reasonable hardware" I generally expect 50M docs/shard.
YMMV,  based on document and use cases, faceting, whatever. I've seen
300M docs fit in 12G. I've seen 10M docs strain pretty beefy machines.

So let's say you do shard. First of all to have any hope of serving
queries from only specific shards you need to control routing the docs
at index time (which you can do). Let's say you do this successfully
(and my favorite example is 30 days worth of news stories). Let's
further say you allow searches by "the last day, the last week, etc".
In the absurd example of all searches being on breaking news in the
last 24 hours, all your searchers are being carried out on a single
shard which makes poor use of your hardware. Not saying doing this is
always bad, just something to consider.

FWIW,
Erick

On Fri, Oct 21, 2016 at 8:54 AM,  <hairymccla...@yahoo.com.invalid> wrote:
>>>> Yes, that's possible.  It's what I was thinking about when I mentioned 
>>>> "...general case flow".  That capability is relatively new, and not the 
>>>> default, which is why I didn't mention it.
> Yes, thought you probably meant that, was just adding it explicitly.
>>>> And load balancing for reliability purposes is always a good thing.
> Generally regarding load balancing queries - I did hope that SolrCloud would 
> have a standard load balancing mechanism for all client types rather than 
> just those using a specific Java library. There are two elements to this - 
> firstly distributing a query across shards and secondly choosing a replica 
> for a given shard (which you mention that will happen - presumably regardless 
> of client used?).
> Because of the merging stuff you mention I personally intend to test a few 
> different sharding strategies. I'd be interested to know if anyone has a rule 
> of thumb about when it makes sense to shard and live with the merge hit and 
> when it makes sense to shard based on the most common queries (so they end up 
> getting served by a single or small number of shards or their replicas).
>
>     On Friday, October 21, 2016 2:39 PM, Garth Grimm 
> <garthgr...@averyranchconsulting.com> wrote:
>
>
>  I just realized that I made an assumption about your initial question that 
> may not be true.
>
> Everything I've said has been based on handling requests to add/update 
> documents during the indexing process.  That process involves the "leader 
> first" concept I've been mentioning.
>
> So to answer your original question on the query side....
>
>> Actually, zookeeper really won't participate in the query process at all.  
>> And the leader role for a core in a shard has no bearing whatsoever.
>>
>> ;-) Read ymonad's answer. ;-)  The CloudSolrServer class has been renamed to 
>> CloudSolrClient (or something similar) recently, but otherwise, I think his 
>> answer is still basically correct.
>
> It's worth noting that even if the node that receives the request has a core 
> that could participate in generating results, it might ask some other core of 
> that same shard to return the results for that shard.  The preferLocalShards 
> parameter can be used to avoid that (near the bottom of 
> https://cwiki.apache.org/confluence/display/solr/Distributed+Requests).
>
> In any case, if you have many shards, load balancing on the query side is 
> definitely more important than on the indexing side.  The query controller 
> will have to merge the result sets (one from each shard), and initiate the 
> second pass of requests to get stored fields, and then marshall all that data 
> back through the HTTP response.  That's more extra work then the controller 
> has to do for an update request, which is basically just pass along whatever 
> information the shard leader responded with.
>
> And load balancing for reliability purposes is always a good thing.
>
>>>> Also, for indexing, I think it's possible to control how many replicas 
>>>> need to confirm to the leader before the response is supplied to the 
>>>> client, as you can with say MongoDB replicas.
>
> Yes, that's possible.  It's what I was thinking about when I mentioned 
> "...general case flow".  That capability is relatively new, and not the 
> default, which is why I didn't mention it.
>
> -----Original Message-----
> From: hairymccla...@yahoo.com.INVALID [mailto:hairymccla...@yahoo.com.INVALID]
> Sent: Friday, October 21, 2016 4:07 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Load balancing with solr cloud
>
> As I understand it for non-SolrCloud aware clients you have to manually load 
> balance your searches, see ymonad's answer here:
> http://stackoverflow.com/questions/22523588/loadbalancer-and-solrcloud
>
> This is from 2014 so maybe this has changed now - would be interested to know 
> as well.
> Also, for indexing, I think it's possible to control how many replicas need 
> to confirm to the leader before the response is supplied to the client, as 
> you can with say MongoDB replicas.
>
>
>
>     On Friday, October 21, 2016 1:18 AM, Garth Grimm 
> <garthgr...@averyranchconsulting.com> wrote:
>
>
>  No matter where you send the update to initially, it will get sent to the 
> leader of the shard first.  The leader does a parsing of it to ensure it can 
> be indexed, then it will send it to all the replicas in parallel.  The 
> replicas will do their parsing and report back that they have persisted the 
> data to their tlogs.  Once the leader hears back from all the replicas, the 
> leader will reply back that the update is complete, and your client will 
> receive it's HTTP response on the transaction.
>
> At least that's the general case flow.
>
> So it really won't matter how your load balancing is handled above the cloud. 
>  All the work is done the same way, with the leader having to do slightly 
> more work than the replicas.
>
> If you can manage to initially send all the updates to the correct leader, 
> you can skip one hop before the work starts, which may buy you a small 
> performance boost compared to randomly picking a node to send the request to. 
>  But you'll need to be taxing the cloud pretty heavily before that difference 
> becomes too noticeable.
>
> -----Original Message-----
> From: Sadheera Vithanage [mailto:sadhee...@gmail.com]
> Sent: Thursday, October 20, 2016 5:55 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Load balancing with solr cloud
>
> Thank you very much John and Garth,
>
> I've tested it out and it works fine, I can send the updates to any of the 
> solr nodes.
>
> If I am not using a zookeeper aware client and If I direct all my queries 
> (read queries) always to the leader of the solr instances,does it 
> automatically load balance between the replicas?
>
> Or do I have to hit each instance in a round robin way and have the load 
> balanced through the code?
>
> Please advise the best way to do so..
>
> Thank you very much again..
>
>
>
> On Fri, Oct 21, 2016 at 9:18 AM, Garth Grimm < 
> garthgr...@averyranchconsulting.com> wrote:
>
>> Actually, zookeeper really won't participate in the update process at all.
>>
>> If you're using a "zookeeper aware" client like SolrJ, the SolrJ
>> library will read the cloud configuration from zookeeper, but will
>> send all the updates to the leader of the shard that the document is meant 
>> to go to.
>>
>> If you're not using a "zookeeper aware" client, you can send the
>> update to any of the solr nodes, and they will evaluate the cloud
>> configuration information they've already received from zookeeper, and
>> then forward the document to leader of the shard that will handle the 
>> document update.
>>
>> In general, Zookeeper really only provides the cloud configuration
>> information once (at most) during all the updates, the actual document
>> update only gets sent to solr nodes.  There's definitely no need to
>> distribute load between zookeepers for this situation.
>>
>> Regards,
>> Garth Grimm
>>
>> -----Original Message-----
>> From: Sadheera Vithanage [mailto:sadhee...@gmail.com]
>> Sent: Thursday, October 20, 2016 5:11 PM
>> To: solr-user@lucene.apache.org
>> Subject: Load balancing with solr cloud
>>
>> Hi again Experts,
>>
>> I have a question related to load balancing in solr cloud.
>>
>> If we have 3 zookeeper nodes and 3 solr instances (1 leader, 2
>> secondary replicas and 1 shard), when the traffic comes in the primary
>> zookeeper server will be hammered, correct?
>>
>> I understand (or is it wrong) that zookeeper will load balance between
>> solr nodes but if we want to distribute the load between zookeeper
>> nodes as well, what is the best approach.
>>
>> Cost is a concern for us too.
>>
>> Thank you very much, in advance.
>>
>> --
>> Regards
>>
>> Sadheera Vithanage
>>
>
>
>
> --
> Regards
>
> Sadheera Vithanage
>
>
>
>
>
>

Re: Load balancing with solr cloud

Reply via email to