On 8/2/2017 8:56 AM, Michael B. Klein wrote:
> SCALE DOWN
> 1) Call admin/collections?action=BACKUP for each collection to a
> shared NFS volume
> 2) Shut down all the nodes
>
> SCALE UP
> 1) Spin up 2 Zookeeper nodes and wait for them to stabilize
> 2) Spin up 3 Solr nodes and wait for them to show up under Zookeeper's
> live_nodes
> 3) Call admin/collections?action=RESTORE to put all the collections back
>
> This has been working very well, for the most part, with the following
> complications/observations:
>
> 1) If I don't optimize each collection right before BACKUP, the backup
> fails (see the attached solr_backup_error.json).

Sounds like you're being hit by this at backup time:

https://issues.apache.org/jira/browse/SOLR-9120

There's a patch in the issue which I have not verified and tested.  The
workaround of optimizing the collection is not one I would have thought of.

> 2) If I don't specify a replicationFactor during RESTORE, the admin
> interface's Cloud diagram only shows one active node per collection.
> Is this expected? Am I required to specify the replicationFactor
> unless I'm using a shared HDFS volume for solr data?

The documentation for RESTORE (looking at the 6.6 docs) says that the
restored collection will have the same number of shards and replicas as
the original collection.  Your experience says that either the
documentation is wrong or the version of Solr you're running doesn't
behave that way, and might have a bug.

> 3) If I don't specify maxShardsPerNode=1 during RESTORE, I get a
> warning message in the response, even though the restore seems to succeed.

I would like to see that warning, including whatever stacktrace is
present.  It might be expected, but I'd like to look into it.

> 4) Aside from the replicationFactor parameter on the CREATE/RESTORE, I
> do not currently have any replication stuff configured (as it seems I
> should not).

Correct, you don't need any replication configured.  It's not for cloud
mode.

> 5) At the time my "1-in-3 requests are failing" issue occurred, the
> Cloud diagram looked like the attached solr_admin_cloud_diagram.png.
> It seemed to think all replicas were live and synced and happy, and
> because I was accessing solr through a round-robin load balancer, I
> was never able to tell which node was out of sync.
>
> If it happens again, I'll make node-by-node requests and try to figure
> out what's different about the failing one. But the fact that this
> happened (and the way it happened) is making me wonder if/how I can
> automate this automated staging environment scaling reliably and with
> confidence that it will Just Work™.

That image didn't make it to the mailing list.  Your JSON showing errors
did, though.  Your description of the diagram is good -- sounds like it
was all green and looked exactly how you expected it to look.

What you've described sounds like there may be a problem in the RESTORE
action on the collections API, or possibly a problem with your shared
storage where you put the backups, so the restored data on one replica
isn't faithful to the backup.  I don't know very much about that code,
and what you've described makes me think that this is going to be a hard
one to track down.

Thanks,
Shawn

Reply via email to