Re: Replication Question

Shawn Heisey Tue, 01 Aug 2017 14:37:15 -0700

On 8/1/2017 12:09 PM, Michael B. Klein wrote:
> I have a 3-node solrcloud cluster orchestrated by zookeeper. Most stuff
> seems to be working OK, except that one of the nodes never seems to get its
> replica updated.
>
> Queries take place through a non-caching, round-robin load balancer. The
> collection looks fine, with one shard and a replicationFactor of 3.
> Everything in the cloud diagram is green.
>
> But if I (for example) select?q=id:hd76s004z, the results come up empty 1
> out of every 3 times.
>
> Even several minutes after a commit and optimize, one replica still isn’t
> returning the right info.
>
> Do I need to configure my `solrconfig.xml` with `replicateAfter` options on
> the `/replication` requestHandler, or is that a non-solrcloud,
> standalone-replication thing?


This is one of the more confusing aspects of SolrCloud.

When everything is working perfectly in a SolrCloud install, the feature
in Solr called "replication" is *never* used.  SolrCloud does require
the replication feature, though ... which is what makes this whole thing
very confusing.

Replication is used to replicate an entire Lucene index (consisting of a
bunch of files on the disk) from a core on a master server to a core on
a slave server.  This is how replication was done before SolrCloud was
created.

The way that SolrCloud keeps replicas in sync is *entirely* different. 
SolrCloud has no masters and no slaves.  When you index or delete a
document in a SolrCloud collection, the request is forwarded to the
leader of the correct shard for that document.  The leader then sends a
copy of that request to all the other replicas, and each replica
(including the leader) independently handles the updates that are in the
request.  Since all replicas index the same content, they stay in sync.

What SolrCloud does with the replication feature is index recovery.  In
some situations recovery can be done from the leader's transaction log,
but when a replica has gotten so far out of sync that the only option
available is to completely replace the index on the bad replica,
SolrCloud will fire up the replication feature and create an exact copy
of the index from the replica that is currently elected as leader. 
SolrCloud temporarily designates the leader core as master and the bad
replica as slave, then initiates a one-time replication.  This is all
completely automated and requires no configuration or input from the
administrator.

The configuration elements you have asked about are for the old
master-slave replication setup and do not apply to SolrCloud at all.

What I would recommend that you do to solve your immediate issue:  Shut
down the Solr instance that is having the problem, rename the "data"
directory in the core that isn't working right to something else, and
start Solr back up.  As long as you still have at least one good replica
in the cloud, SolrCloud will see that the index data is gone and copy
the index from the leader.  You could delete the data directory instead
of renaming it, but that would leave you with no "undo" option.

Thanks,
Shawn

Re: Replication Question

Reply via email to