I just now tried it with Solr7.4 and am getting the same symptoms as I describe below.
The symptoms I describe are quite different from my impression of Shawn Heisey's impression of my symptoms, so I will describe my symptoms again. Let us assume that we start with a SolrCloud of two nodes: one at hostname1:9999 and the other at hostname2:9999 Let us assume that we have a one-shard collection with two replicas. One of the replicas is on the node at hostname1:9999 (with the core col_shard1_replica_n1) and the other on the node at hostname2:9999 (with the core col_shard1_replica_n3) Then I run SPLITSHARD I end up with four cores instead of two, as expected. The problem is that three of the four cores (col_shard1_0_replica_n5, col_shard1_0_replica0 and col_shard1_1_replica_n6) are *all on hostname1*. Only col_shard1_1_replica0 was placed on hostname2. Prior to the SPLITSHARD, if hostname1 becomes temporarily unavailable, the SolrCloud can still be used: hostname2 has all the data. After the SPLITSHARD, if hostname1 becomes temporarily unavailable, the SolrCloud does not have any access to the data in shard1_0 Granted, I could add a replica of shard1_0 onto hostname2, and I could then drop one of the extraneous shard1_0 replicas which are on hostname1: but I don't see the logic in requiring such additional steps every time. My question is: How can I tell Solr "avoid putting two replicas of the same shard on the same node"? -----Original Message----- From: Shawn Heisey [mailto:apa...@elyograg.org] Sent: Tuesday, June 19, 2018 2:20 PM To: solr-user@lucene.apache.org Subject: Re: sharding and placement of replicas On 6/15/2018 11:08 AM, Oakley, Craig (NIH/NLM/NCBI) [C] wrote: > If I start with a collection X on two nodes with one shard and two replicas > (for redundancy, in case a node goes down): a node on host1 has > X_shard1_replica1 and a node on host2 has X_shard1_replica2: when I try > SPLITSHARD, I generally get X_shard1_0_replica1, X_shard1_1_replica1 and > X_shard1_0_replica0 all on the node on host1 with X_shard1_1_replica0 sitting > alone on the node on host2. If host1 were to go down at this point, shard1_0 > would be unavailable. https://lucene.apache.org/solr/guide/6_6/collections-api.html#CollectionsAPI-splitshard That documentation says "The new shards will have as many replicas as the original shard." That tells me that what you're seeing is not matching the *intent* of the SPLITSHARD feature. The fact that you get *one* of the new shards but not the other is suspicious. I'm wondering if maybe Solr tried to create it but had a problem doing so. Can you check for errors in the solr logfile on host2? If there's nothing about your environment that would cause a failure to create the replica, then it might be a bug. > Is there a way either of specifying placement or of giving hints that > replicas ought to be separated? It shouldn't be necessary to give Solr any parameters for that. All nodes where the shard exists should get copies of the new shards when you split it. > I am currently running Solr6.6.0, if that is relevant. If this is a provable and reproducible bug, and it's still a problem in the current stable branch (next release from that will be 7.4.0), then it will definitely be fixed. If it's only a problem in 6.x, then I can't guarantee that it will be fixed. That's because the 6.x line is in maintenance mode, which means that there's a very high bar for changes. In most cases, only changes that meet one of these criteria are made in maintenance mode: * Fixes a security bug. * Fixes a MAJOR bug with no workaround. * Fix is a very trivial code change and not likely to introduce new bugs. Of those criteria, generally only the first two are likely to prompt an actual new software release. If enough changes of the third type accumulate, that might prompt a new release. My personal opinion: If this is a general problem in 6.x, it should be fixed there. Because there is a workaround, it would not be cause for an immediate new release. Thanks, Shawn