Re: Replication Question

Michael B. Klein Wed, 02 Aug 2017 08:38:31 -0700

And the one that isn't getting the updates is the one marked in the cloud
diagram as the leader.


/me bangs head on desk

On Wed, Aug 2, 2017 at 10:31 AM, Michael B. Klein <mbkl...@gmail.com> wrote:

> Another observation: After bringing the cluster back up just now, the
> "1-in-3 nodes don't get the updates" issue persists, even with the cloud
> diagram showing 3 nodes, all green.
>
> On Wed, Aug 2, 2017 at 9:56 AM, Michael B. Klein <mbkl...@gmail.com>
> wrote:
>
>> Thanks for your responses, Shawn and Erick.
>>
>> Some clarification questions, but first a description of my
>> (non-standard) use case:
>>
>> My Zookeeper/SolrCloud cluster is running on Amazon AWS. Things are
>> working well so far on the production cluster (knock wood); its the staging
>> cluster that's giving me fits. Here's why: In order to save money, I have
>> the AWS auto-scaler scale the cluster down to zero nodes when it's not in
>> use. Here's the (automated) procedure:
>>
>> SCALE DOWN
>> 1) Call admin/collections?action=BACKUP for each collection to a shared
>> NFS volume
>> 2) Shut down all the nodes
>>
>> SCALE UP
>> 1) Spin up 2 Zookeeper nodes and wait for them to stabilize
>> 2) Spin up 3 Solr nodes and wait for them to show up under Zookeeper's
>> live_nodes
>> 3) Call admin/collections?action=RESTORE to put all the collections back
>>
>> This has been working very well, for the most part, with the following
>> complications/observations:
>>
>> 1) If I don't optimize each collection right before BACKUP, the backup
>> fails (see the attached solr_backup_error.json).
>> 2) If I don't specify a replicationFactor during RESTORE, the admin
>> interface's Cloud diagram only shows one active node per collection. Is
>> this expected? Am I required to specify the replicationFactor unless I'm
>> using a shared HDFS volume for solr data?
>> 3) If I don't specify maxShardsPerNode=1 during RESTORE, I get a warning
>> message in the response, even though the restore seems to succeed.
>> 4) Aside from the replicationFactor parameter on the CREATE/RESTORE, I do
>> not currently have any replication stuff configured (as it seems I should
>> not).
>> 5) At the time my "1-in-3 requests are failing" issue occurred, the Cloud
>> diagram looked like the attached solr_admin_cloud_diagram.png. It seemed to
>> think all replicas were live and synced and happy, and because I was
>> accessing solr through a round-robin load balancer, I was never able to
>> tell which node was out of sync.
>>
>> If it happens again, I'll make node-by-node requests and try to figure
>> out what's different about the failing one. But the fact that this happened
>> (and the way it happened) is making me wonder if/how I can automate this
>> automated staging environment scaling reliably and with confidence that it
>> will Just Work™.
>>
>> Comments and suggestions would be GREATLY appreciated.
>>
>> Michael
>>
>>
>>
>> On Tue, Aug 1, 2017 at 8:14 PM, Erick Erickson <erickerick...@gmail.com>
>> wrote:
>>
>>> And please do not use optimize unless your index is
>>> totally static. I only recommend it when the pattern is
>>> to update the index periodically, like every day or
>>> something and not update any docs in between times.
>>>
>>> Implied in Shawn's e-mail was that you should undo
>>> anything you've done in terms of configuring replication,
>>> just go with the defaults.
>>>
>>> Finally, my bet is that your problematic Solr node is misconfigured.
>>>
>>> Best,
>>> Erick
>>>
>>> On Tue, Aug 1, 2017 at 2:36 PM, Shawn Heisey <apa...@elyograg.org>
>>> wrote:
>>> > On 8/1/2017 12:09 PM, Michael B. Klein wrote:
>>> >> I have a 3-node solrcloud cluster orchestrated by zookeeper. Most
>>> stuff
>>> >> seems to be working OK, except that one of the nodes never seems to
>>> get its
>>> >> replica updated.
>>> >>
>>> >> Queries take place through a non-caching, round-robin load balancer.
>>> The
>>> >> collection looks fine, with one shard and a replicationFactor of 3.
>>> >> Everything in the cloud diagram is green.
>>> >>
>>> >> But if I (for example) select?q=id:hd76s004z, the results come up
>>> empty 1
>>> >> out of every 3 times.
>>> >>
>>> >> Even several minutes after a commit and optimize, one replica still
>>> isn’t
>>> >> returning the right info.
>>> >>
>>> >> Do I need to configure my `solrconfig.xml` with `replicateAfter`
>>> options on
>>> >> the `/replication` requestHandler, or is that a non-solrcloud,
>>> >> standalone-replication thing?
>>> >
>>> > This is one of the more confusing aspects of SolrCloud.
>>> >
>>> > When everything is working perfectly in a SolrCloud install, the
>>> feature
>>> > in Solr called "replication" is *never* used.  SolrCloud does require
>>> > the replication feature, though ... which is what makes this whole
>>> thing
>>> > very confusing.
>>> >
>>> > Replication is used to replicate an entire Lucene index (consisting of
>>> a
>>> > bunch of files on the disk) from a core on a master server to a core on
>>> > a slave server.  This is how replication was done before SolrCloud was
>>> > created.
>>> >
>>> > The way that SolrCloud keeps replicas in sync is *entirely* different.
>>> > SolrCloud has no masters and no slaves.  When you index or delete a
>>> > document in a SolrCloud collection, the request is forwarded to the
>>> > leader of the correct shard for that document.  The leader then sends a
>>> > copy of that request to all the other replicas, and each replica
>>> > (including the leader) independently handles the updates that are in
>>> the
>>> > request.  Since all replicas index the same content, they stay in sync.
>>> >
>>> > What SolrCloud does with the replication feature is index recovery.  In
>>> > some situations recovery can be done from the leader's transaction log,
>>> > but when a replica has gotten so far out of sync that the only option
>>> > available is to completely replace the index on the bad replica,
>>> > SolrCloud will fire up the replication feature and create an exact copy
>>> > of the index from the replica that is currently elected as leader.
>>> > SolrCloud temporarily designates the leader core as master and the bad
>>> > replica as slave, then initiates a one-time replication.  This is all
>>> > completely automated and requires no configuration or input from the
>>> > administrator.
>>> >
>>> > The configuration elements you have asked about are for the old
>>> > master-slave replication setup and do not apply to SolrCloud at all.
>>> >
>>> > What I would recommend that you do to solve your immediate issue:  Shut
>>> > down the Solr instance that is having the problem, rename the "data"
>>> > directory in the core that isn't working right to something else, and
>>> > start Solr back up.  As long as you still have at least one good
>>> replica
>>> > in the cloud, SolrCloud will see that the index data is gone and copy
>>> > the index from the leader.  You could delete the data directory instead
>>> > of renaming it, but that would leave you with no "undo" option.
>>> >
>>> > Thanks,
>>> > Shawn
>>> >
>>>
>>
>>
>

Re: Replication Question

Reply via email to