Thanks for your responses, Shawn and Erick. Some clarification questions, but first a description of my (non-standard) use case:
My Zookeeper/SolrCloud cluster is running on Amazon AWS. Things are working well so far on the production cluster (knock wood); its the staging cluster that's giving me fits. Here's why: In order to save money, I have the AWS auto-scaler scale the cluster down to zero nodes when it's not in use. Here's the (automated) procedure: SCALE DOWN 1) Call admin/collections?action=BACKUP for each collection to a shared NFS volume 2) Shut down all the nodes SCALE UP 1) Spin up 2 Zookeeper nodes and wait for them to stabilize 2) Spin up 3 Solr nodes and wait for them to show up under Zookeeper's live_nodes 3) Call admin/collections?action=RESTORE to put all the collections back This has been working very well, for the most part, with the following complications/observations: 1) If I don't optimize each collection right before BACKUP, the backup fails (see the attached solr_backup_error.json). 2) If I don't specify a replicationFactor during RESTORE, the admin interface's Cloud diagram only shows one active node per collection. Is this expected? Am I required to specify the replicationFactor unless I'm using a shared HDFS volume for solr data? 3) If I don't specify maxShardsPerNode=1 during RESTORE, I get a warning message in the response, even though the restore seems to succeed. 4) Aside from the replicationFactor parameter on the CREATE/RESTORE, I do not currently have any replication stuff configured (as it seems I should not). 5) At the time my "1-in-3 requests are failing" issue occurred, the Cloud diagram looked like the attached solr_admin_cloud_diagram.png. It seemed to think all replicas were live and synced and happy, and because I was accessing solr through a round-robin load balancer, I was never able to tell which node was out of sync. If it happens again, I'll make node-by-node requests and try to figure out what's different about the failing one. But the fact that this happened (and the way it happened) is making me wonder if/how I can automate this automated staging environment scaling reliably and with confidence that it will Just Work™. Comments and suggestions would be GREATLY appreciated. Michael On Tue, Aug 1, 2017 at 8:14 PM, Erick Erickson <erickerick...@gmail.com> wrote: > And please do not use optimize unless your index is > totally static. I only recommend it when the pattern is > to update the index periodically, like every day or > something and not update any docs in between times. > > Implied in Shawn's e-mail was that you should undo > anything you've done in terms of configuring replication, > just go with the defaults. > > Finally, my bet is that your problematic Solr node is misconfigured. > > Best, > Erick > > On Tue, Aug 1, 2017 at 2:36 PM, Shawn Heisey <apa...@elyograg.org> wrote: > > On 8/1/2017 12:09 PM, Michael B. Klein wrote: > >> I have a 3-node solrcloud cluster orchestrated by zookeeper. Most stuff > >> seems to be working OK, except that one of the nodes never seems to get > its > >> replica updated. > >> > >> Queries take place through a non-caching, round-robin load balancer. The > >> collection looks fine, with one shard and a replicationFactor of 3. > >> Everything in the cloud diagram is green. > >> > >> But if I (for example) select?q=id:hd76s004z, the results come up empty > 1 > >> out of every 3 times. > >> > >> Even several minutes after a commit and optimize, one replica still > isn’t > >> returning the right info. > >> > >> Do I need to configure my `solrconfig.xml` with `replicateAfter` > options on > >> the `/replication` requestHandler, or is that a non-solrcloud, > >> standalone-replication thing? > > > > This is one of the more confusing aspects of SolrCloud. > > > > When everything is working perfectly in a SolrCloud install, the feature > > in Solr called "replication" is *never* used. SolrCloud does require > > the replication feature, though ... which is what makes this whole thing > > very confusing. > > > > Replication is used to replicate an entire Lucene index (consisting of a > > bunch of files on the disk) from a core on a master server to a core on > > a slave server. This is how replication was done before SolrCloud was > > created. > > > > The way that SolrCloud keeps replicas in sync is *entirely* different. > > SolrCloud has no masters and no slaves. When you index or delete a > > document in a SolrCloud collection, the request is forwarded to the > > leader of the correct shard for that document. The leader then sends a > > copy of that request to all the other replicas, and each replica > > (including the leader) independently handles the updates that are in the > > request. Since all replicas index the same content, they stay in sync. > > > > What SolrCloud does with the replication feature is index recovery. In > > some situations recovery can be done from the leader's transaction log, > > but when a replica has gotten so far out of sync that the only option > > available is to completely replace the index on the bad replica, > > SolrCloud will fire up the replication feature and create an exact copy > > of the index from the replica that is currently elected as leader. > > SolrCloud temporarily designates the leader core as master and the bad > > replica as slave, then initiates a one-time replication. This is all > > completely automated and requires no configuration or input from the > > administrator. > > > > The configuration elements you have asked about are for the old > > master-slave replication setup and do not apply to SolrCloud at all. > > > > What I would recommend that you do to solve your immediate issue: Shut > > down the Solr instance that is having the problem, rename the "data" > > directory in the core that isn't working right to something else, and > > start Solr back up. As long as you still have at least one good replica > > in the cloud, SolrCloud will see that the index data is gone and copy > > the index from the leader. You could delete the data directory instead > > of renaming it, but that would leave you with no "undo" option. > > > > Thanks, > > Shawn > > >
solr_backup_error.json
Description: application/json