On 8/1/2017 12:09 PM, Michael B. Klein wrote: > I have a 3-node solrcloud cluster orchestrated by zookeeper. Most stuff > seems to be working OK, except that one of the nodes never seems to get its > replica updated. > > Queries take place through a non-caching, round-robin load balancer. The > collection looks fine, with one shard and a replicationFactor of 3. > Everything in the cloud diagram is green. > > But if I (for example) select?q=id:hd76s004z, the results come up empty 1 > out of every 3 times. > > Even several minutes after a commit and optimize, one replica still isn’t > returning the right info. > > Do I need to configure my `solrconfig.xml` with `replicateAfter` options on > the `/replication` requestHandler, or is that a non-solrcloud, > standalone-replication thing?
This is one of the more confusing aspects of SolrCloud. When everything is working perfectly in a SolrCloud install, the feature in Solr called "replication" is *never* used. SolrCloud does require the replication feature, though ... which is what makes this whole thing very confusing. Replication is used to replicate an entire Lucene index (consisting of a bunch of files on the disk) from a core on a master server to a core on a slave server. This is how replication was done before SolrCloud was created. The way that SolrCloud keeps replicas in sync is *entirely* different. SolrCloud has no masters and no slaves. When you index or delete a document in a SolrCloud collection, the request is forwarded to the leader of the correct shard for that document. The leader then sends a copy of that request to all the other replicas, and each replica (including the leader) independently handles the updates that are in the request. Since all replicas index the same content, they stay in sync. What SolrCloud does with the replication feature is index recovery. In some situations recovery can be done from the leader's transaction log, but when a replica has gotten so far out of sync that the only option available is to completely replace the index on the bad replica, SolrCloud will fire up the replication feature and create an exact copy of the index from the replica that is currently elected as leader. SolrCloud temporarily designates the leader core as master and the bad replica as slave, then initiates a one-time replication. This is all completely automated and requires no configuration or input from the administrator. The configuration elements you have asked about are for the old master-slave replication setup and do not apply to SolrCloud at all. What I would recommend that you do to solve your immediate issue: Shut down the Solr instance that is having the problem, rename the "data" directory in the core that isn't working right to something else, and start Solr back up. As long as you still have at least one good replica in the cloud, SolrCloud will see that the index data is gone and copy the index from the leader. You could delete the data directory instead of renaming it, but that would leave you with no "undo" option. Thanks, Shawn