On 8/2/2017 8:56 AM, Michael B. Klein wrote: > SCALE DOWN > 1) Call admin/collections?action=BACKUP for each collection to a > shared NFS volume > 2) Shut down all the nodes > > SCALE UP > 1) Spin up 2 Zookeeper nodes and wait for them to stabilize > 2) Spin up 3 Solr nodes and wait for them to show up under Zookeeper's > live_nodes > 3) Call admin/collections?action=RESTORE to put all the collections back > > This has been working very well, for the most part, with the following > complications/observations: > > 1) If I don't optimize each collection right before BACKUP, the backup > fails (see the attached solr_backup_error.json).
Sounds like you're being hit by this at backup time: https://issues.apache.org/jira/browse/SOLR-9120 There's a patch in the issue which I have not verified and tested. The workaround of optimizing the collection is not one I would have thought of. > 2) If I don't specify a replicationFactor during RESTORE, the admin > interface's Cloud diagram only shows one active node per collection. > Is this expected? Am I required to specify the replicationFactor > unless I'm using a shared HDFS volume for solr data? The documentation for RESTORE (looking at the 6.6 docs) says that the restored collection will have the same number of shards and replicas as the original collection. Your experience says that either the documentation is wrong or the version of Solr you're running doesn't behave that way, and might have a bug. > 3) If I don't specify maxShardsPerNode=1 during RESTORE, I get a > warning message in the response, even though the restore seems to succeed. I would like to see that warning, including whatever stacktrace is present. It might be expected, but I'd like to look into it. > 4) Aside from the replicationFactor parameter on the CREATE/RESTORE, I > do not currently have any replication stuff configured (as it seems I > should not). Correct, you don't need any replication configured. It's not for cloud mode. > 5) At the time my "1-in-3 requests are failing" issue occurred, the > Cloud diagram looked like the attached solr_admin_cloud_diagram.png. > It seemed to think all replicas were live and synced and happy, and > because I was accessing solr through a round-robin load balancer, I > was never able to tell which node was out of sync. > > If it happens again, I'll make node-by-node requests and try to figure > out what's different about the failing one. But the fact that this > happened (and the way it happened) is making me wonder if/how I can > automate this automated staging environment scaling reliably and with > confidence that it will Just Workâ˘. That image didn't make it to the mailing list. Your JSON showing errors did, though. Your description of the diagram is good -- sounds like it was all green and looked exactly how you expected it to look. What you've described sounds like there may be a problem in the RESTORE action on the collections API, or possibly a problem with your shared storage where you put the backups, so the restored data on one replica isn't faithful to the backup. I don't know very much about that code, and what you've described makes me think that this is going to be a hard one to track down. Thanks, Shawn