Re: Querying only replica's

Robert Brown Mon, 11 Jan 2016 09:25:37 -0800

We won't be using SolrJ, etc. anytime soon unfortunately.

We'll be using a hardware load-balancer to send requests into thecloud/pool of servers.

The LB therefore needs to know when a node is down, otherwise a querywouldn't get anywhere.


The solr.PingRequestHandler is what I was after.




On 01/11/2016 05:16 PM, Alessandro Benedetti wrote:

mmm i think there is a misconception here :

On 10 January 2016 at 19:00, Robert Brown <r...@intelcompute.com> wrote:

I'm thinking more about how the external load-balancer will know if a node
is down, as to take it out the pool of active servers to even attempt
sending a query to.

This is SolrCloud responsibility and in particular Zookeeper knows the
topology of the cluster.
A query will not reach a dead node.
You should use a SolrCloud aware client ( like the SolrJ one) .

If you want to use a different load-balancer because you don't like the
SolrCloud one, it will not be that easy, because the distribution of the
queries happens automatically.

Cheers

I could ping tho that just means the IP is alive.  I could configure the
load-balancer to actually try a query, but this may be (even a tiny)
performance hit.

Is there another recommended way of configuring external load-balancers to
know when a node is not accepting queries?




On 10/01/16 18:25, Erick Erickson wrote:

For health checks, you can go ahead and get the real IP addresses and
ping them directly if you care to.... Or just let Zookeeper do that
for you. One of the tasks of Zookeeper is pinging all the machines
with all the replicas and, if any of them are unreachable, telling the
rest of the cluster that that machine is down.

Best,
Erick

On Sun, Jan 10, 2016 at 5:19 AM, Robert Brown <r...@intelcompute.com>
wrote:

Thanks Erick,

For the health-checks on the load-balancer side, would you recommend a
simple query, or is there a reliable ping or similar for this scenario?

Cheers,
Rob


On 09/01/16 23:44, Erick Erickson wrote:

bq: is it best/good to get the CLUSTERSTATUS via the collection API
and explicitly send queries to a replica to ensure I don't send
queries to the leaders of my collection

In a word _no_. SolrCloud is vastly different than the old
master/slave. In SolrCloud, each and every node (leader and replicas)
index all the docs and serve queries. The additional burden the leader
has is actually very small. There's absolutely no reason to _not_ use
the leader to serve queries.

As far as sending updates, there would be a _little_ benefit to
sending the updates directly to the leader, but _far_ more benefit in
using SolrJ. If you use SolrJ (and CloudSolrClient), then the
documents are split up on the _client_ and only the docs for a
particular shard are automatically sent to the leader for that shard.
Using SolrJ you can essentially scale indexing linearly with the
number of shards you have. Just using HTTP does not scale linearly.
Your particular app may not care, but in high-throughput situations
this can be significant.

So rather than spend time and effort sending updates directly to a
leader and have the leader then forward the docs to the correct shard,
I recommend investing the time in using SolrJ for updates rather than
sending updates to the leader over HTTP. Or just ignore the problem
and devote your efforts to something that are more valuable.

So in short:
1> just stick a load balancer in front of _all_ your Solr nodes for
queries. And note that there's an internal load balancer already in
Solr that routes things around anyway, although putting a load
balancer in front of your entire cluster makes it so there's not a
single point of failure.
2> Depending on your throughput needs, either
2a> use SolrJ to index
2b> don't worry about it and send updates through the load balancer as
well. There'll be an extra hop if you send updates to a replica, but
if that's significant you should be using SolrJ

As for 5.5, it's not at all clear that there _will_ be a 5.5. 5.4 was
just released in early December. There's usually a several month lag
between point releases and there's some agitation to start the 6.0
release process, so it's up in the air.


On Sat, Jan 9, 2016 at 12:04 PM, Robert Brown <r...@intelcompute.com>
wrote:

Hi,

(btw, when is 5.5 due?  I see the docs reference it, but not the
download
page)

Anyway, I index and query Solr over HTTP (no SolrJ, etc.) - is it
best/good
to get the CLUSTERSTATUS via the collection API and explicitly send
queries
to a replica to ensure I don't send queries to the leaders of my
collection,
to improve performance?  Like-wise with sending updates directly to a
Leader?

My leaders will receive full updates of the entire collection once a
day,
so
I would assume if the leader is handling queries too, performance would
be
hit?

Is the CLUSTERSTATUS API the only way to do this btw without SolrJ,
etc.?
I
wasn't sure if ZooKeeper would be able to tell me also.

Do I also need to do anything to ensure the leaders are never sent
queries
from the replica's?

Does this all sound sane?

One of my collections is 3 shards, with 2 replica's each (9 total
nodes),
70m docs in total.

Thanks,
Rob

Re: Querying only replica's

Reply via email to