And the one that isn't getting the updates is the one marked in the cloud diagram as the leader.
/me bangs head on desk On Wed, Aug 2, 2017 at 10:31 AM, Michael B. Klein <mbkl...@gmail.com> wrote: > Another observation: After bringing the cluster back up just now, the > "1-in-3 nodes don't get the updates" issue persists, even with the cloud > diagram showing 3 nodes, all green. > > On Wed, Aug 2, 2017 at 9:56 AM, Michael B. Klein <mbkl...@gmail.com> > wrote: > >> Thanks for your responses, Shawn and Erick. >> >> Some clarification questions, but first a description of my >> (non-standard) use case: >> >> My Zookeeper/SolrCloud cluster is running on Amazon AWS. Things are >> working well so far on the production cluster (knock wood); its the staging >> cluster that's giving me fits. Here's why: In order to save money, I have >> the AWS auto-scaler scale the cluster down to zero nodes when it's not in >> use. Here's the (automated) procedure: >> >> SCALE DOWN >> 1) Call admin/collections?action=BACKUP for each collection to a shared >> NFS volume >> 2) Shut down all the nodes >> >> SCALE UP >> 1) Spin up 2 Zookeeper nodes and wait for them to stabilize >> 2) Spin up 3 Solr nodes and wait for them to show up under Zookeeper's >> live_nodes >> 3) Call admin/collections?action=RESTORE to put all the collections back >> >> This has been working very well, for the most part, with the following >> complications/observations: >> >> 1) If I don't optimize each collection right before BACKUP, the backup >> fails (see the attached solr_backup_error.json). >> 2) If I don't specify a replicationFactor during RESTORE, the admin >> interface's Cloud diagram only shows one active node per collection. Is >> this expected? Am I required to specify the replicationFactor unless I'm >> using a shared HDFS volume for solr data? >> 3) If I don't specify maxShardsPerNode=1 during RESTORE, I get a warning >> message in the response, even though the restore seems to succeed. >> 4) Aside from the replicationFactor parameter on the CREATE/RESTORE, I do >> not currently have any replication stuff configured (as it seems I should >> not). >> 5) At the time my "1-in-3 requests are failing" issue occurred, the Cloud >> diagram looked like the attached solr_admin_cloud_diagram.png. It seemed to >> think all replicas were live and synced and happy, and because I was >> accessing solr through a round-robin load balancer, I was never able to >> tell which node was out of sync. >> >> If it happens again, I'll make node-by-node requests and try to figure >> out what's different about the failing one. But the fact that this happened >> (and the way it happened) is making me wonder if/how I can automate this >> automated staging environment scaling reliably and with confidence that it >> will Just Work™. >> >> Comments and suggestions would be GREATLY appreciated. >> >> Michael >> >> >> >> On Tue, Aug 1, 2017 at 8:14 PM, Erick Erickson <erickerick...@gmail.com> >> wrote: >> >>> And please do not use optimize unless your index is >>> totally static. I only recommend it when the pattern is >>> to update the index periodically, like every day or >>> something and not update any docs in between times. >>> >>> Implied in Shawn's e-mail was that you should undo >>> anything you've done in terms of configuring replication, >>> just go with the defaults. >>> >>> Finally, my bet is that your problematic Solr node is misconfigured. >>> >>> Best, >>> Erick >>> >>> On Tue, Aug 1, 2017 at 2:36 PM, Shawn Heisey <apa...@elyograg.org> >>> wrote: >>> > On 8/1/2017 12:09 PM, Michael B. Klein wrote: >>> >> I have a 3-node solrcloud cluster orchestrated by zookeeper. Most >>> stuff >>> >> seems to be working OK, except that one of the nodes never seems to >>> get its >>> >> replica updated. >>> >> >>> >> Queries take place through a non-caching, round-robin load balancer. >>> The >>> >> collection looks fine, with one shard and a replicationFactor of 3. >>> >> Everything in the cloud diagram is green. >>> >> >>> >> But if I (for example) select?q=id:hd76s004z, the results come up >>> empty 1 >>> >> out of every 3 times. >>> >> >>> >> Even several minutes after a commit and optimize, one replica still >>> isn’t >>> >> returning the right info. >>> >> >>> >> Do I need to configure my `solrconfig.xml` with `replicateAfter` >>> options on >>> >> the `/replication` requestHandler, or is that a non-solrcloud, >>> >> standalone-replication thing? >>> > >>> > This is one of the more confusing aspects of SolrCloud. >>> > >>> > When everything is working perfectly in a SolrCloud install, the >>> feature >>> > in Solr called "replication" is *never* used. SolrCloud does require >>> > the replication feature, though ... which is what makes this whole >>> thing >>> > very confusing. >>> > >>> > Replication is used to replicate an entire Lucene index (consisting of >>> a >>> > bunch of files on the disk) from a core on a master server to a core on >>> > a slave server. This is how replication was done before SolrCloud was >>> > created. >>> > >>> > The way that SolrCloud keeps replicas in sync is *entirely* different. >>> > SolrCloud has no masters and no slaves. When you index or delete a >>> > document in a SolrCloud collection, the request is forwarded to the >>> > leader of the correct shard for that document. The leader then sends a >>> > copy of that request to all the other replicas, and each replica >>> > (including the leader) independently handles the updates that are in >>> the >>> > request. Since all replicas index the same content, they stay in sync. >>> > >>> > What SolrCloud does with the replication feature is index recovery. In >>> > some situations recovery can be done from the leader's transaction log, >>> > but when a replica has gotten so far out of sync that the only option >>> > available is to completely replace the index on the bad replica, >>> > SolrCloud will fire up the replication feature and create an exact copy >>> > of the index from the replica that is currently elected as leader. >>> > SolrCloud temporarily designates the leader core as master and the bad >>> > replica as slave, then initiates a one-time replication. This is all >>> > completely automated and requires no configuration or input from the >>> > administrator. >>> > >>> > The configuration elements you have asked about are for the old >>> > master-slave replication setup and do not apply to SolrCloud at all. >>> > >>> > What I would recommend that you do to solve your immediate issue: Shut >>> > down the Solr instance that is having the problem, rename the "data" >>> > directory in the core that isn't working right to something else, and >>> > start Solr back up. As long as you still have at least one good >>> replica >>> > in the cloud, SolrCloud will see that the index data is gone and copy >>> > the index from the leader. You could delete the data directory instead >>> > of renaming it, but that would leave you with no "undo" option. >>> > >>> > Thanks, >>> > Shawn >>> > >>> >> >> >