[SolrCloud] shard hash ranges changed after restoring backup

2016-06-15 Thread Gary Yao
Hi all,

My team at work maintains a SolrCloud 5.3.2 cluster with multiple
collections configured with sharding and replication.

We recently backed up our Solr indexes using the built-in backup
functionality. After the cluster was restored from the backup, we
noticed that atomic updates of documents are failing occasionally with
the error message 'missing required field [...]'. The exceptions are
thrown on a host on which the document to be updated is not stored. From
this we are deducing that there is a problem with finding the right host
by the hash of the uniqueKey. Indeed, our investigations so far showed
that for at least one collection in the new cluster, the shards have
different hash ranges assigned now. We checked the hash ranges by
querying /admin/collections?action=CLUSTERSTATUS. Find below the shard
hash ranges of one collection that we debugged.

  Old cluster:
shard1_0 8000 - aaa9
shard1_1  - d554
shard2_0 d555 - fffe
shard2_1  - 2aa9
shard3_0 2aaa - 5554
shard3_1  - 7fff

  New cluster:
shard1 8000 - aaa9
shard2  - d554
shard3 d555 - 
shard4 0 - 2aa9
shard5 2aaa - 5554
shard6  - 7fff

  Note that the shard names differ because the old cluster's shards were
  split.

As you can see, the ranges of shard3 and shard4 differ from the old
cluster. This change of hash ranges matches with the symptoms we are
currently experiencing.

We found this JIRA ticket https://issues.apache.org/jira/browse/SOLR-5750
in which David Smiley comments:

  shard hash ranges aren't restored; this error could be disasterous

It seems that this is what happened to us. We would like to hear some
suggestions on how we could recover from this problem.

Best,
Gary


Re: [SolrCloud] shard hash ranges changed after restoring backup

2016-06-16 Thread Gary Yao
Hi Erick,

I should add that our Solr cluster is in production and new documents
are constantly indexed. The new cluster has been up for three weeks now.
The problem was discovered only now because in our use case Atomic
Updates and RealTime Gets are mostly performed on new documents. With
almost absolute certainty there are already documents in the index that
were distributed to the shards according to the new hash ranges. If we
just changed the hash ranges in ZooKeeper, the index would still be in
an inconsistent state.

Is there any way to recover from this without having to re-index all
documents?

Best,
Gary

2016-06-15 19:23 GMT+02:00 Erick Erickson :
> Simplest, though a bit risky is to manually edit the znode and
> correct the znode entry. There are various tools out there, including
> one that ships with Zookeeper (see the ZK documentation).
>
> Or you can use the zkcli scripts (the Zookeeper ones) to get the znode
> down to your local machine, edit it there and then push it back up to ZK.
>
> I'd do all this with my Solr nodes shut down, then insure that my ZK
> ensemble was consistent after the update etc
>
> Best,
> Erick
>
> On Wed, Jun 15, 2016 at 8:36 AM, Gary Yao  wrote:
>> Hi all,
>>
>> My team at work maintains a SolrCloud 5.3.2 cluster with multiple
>> collections configured with sharding and replication.
>>
>> We recently backed up our Solr indexes using the built-in backup
>> functionality. After the cluster was restored from the backup, we
>> noticed that atomic updates of documents are failing occasionally with
>> the error message 'missing required field [...]'. The exceptions are
>> thrown on a host on which the document to be updated is not stored. From
>> this we are deducing that there is a problem with finding the right host
>> by the hash of the uniqueKey. Indeed, our investigations so far showed
>> that for at least one collection in the new cluster, the shards have
>> different hash ranges assigned now. We checked the hash ranges by
>> querying /admin/collections?action=CLUSTERSTATUS. Find below the shard
>> hash ranges of one collection that we debugged.
>>
>>   Old cluster:
>> shard1_0 8000 - aaa9
>> shard1_1  - d554
>> shard2_0 d555 - fffe
>> shard2_1  - 2aa9
>> shard3_0 2aaa - 5554
>> shard3_1  - 7fff
>>
>>   New cluster:
>> shard1 8000 - aaa9
>> shard2  - d554
>> shard3 d555 - 
>> shard4 0 - 2aa9
>> shard5 2aaa - 5554
>> shard6  - 7fff
>>
>>   Note that the shard names differ because the old cluster's shards were
>>   split.
>>
>> As you can see, the ranges of shard3 and shard4 differ from the old
>> cluster. This change of hash ranges matches with the symptoms we are
>> currently experiencing.
>>
>> We found this JIRA ticket https://issues.apache.org/jira/browse/SOLR-5750
>> in which David Smiley comments:
>>
>>   shard hash ranges aren't restored; this error could be disasterous
>>
>> It seems that this is what happened to us. We would like to hear some
>> suggestions on how we could recover from this problem.
>>
>> Best,
>> Gary