I am working on an autoscaling kubernetes cluster for solrcloud running 7.5.  I 
have most of this up and working, but ran into a few issues when I got to the 
point of testing.  The core of it is that when solr replaces a replica it is 
doing so as NRT rather than TLOG and it is not respecting the cluster policy on 
selecting solr node locations for that replica.

######### SETUP #######
I am creating the collection with all TLOG nodes for now.  We have 6 nodes, 2 
each in each of 3 availability zones.

My relevant preferences for the autoscaling/solrcloud are:
curl -X POST http://localhost:32080/v2/cluster/autoscaling -d '{ 
"set-cluster-preferences" : [  {"minimize" : "cores"},{"maximize" : "freedisk", 
"precision" : 10},{"minimize" : "sysLoadAvg"}]}'

curl -X POST http://localhost:32080/v2/cluster/autoscaling -d '{ "set-trigger" 
: {"name":"node_added_trigger","event":"nodeAdded","waitFor":"1m", "enabled" : 
true,"actions" : [{"name" : "compute_plan","class": 
"solr.ComputePlanAction"},{"name" : "execute_plan","class": 
"solr.ExecutePlanAction"}]}}'

curl -X POST http://localhost:32080/v2/cluster/autoscaling -d '{ 
"set-cluster-policy" : [{"shard":"#ANY", "replica": 
"<3","sysprop.K8SNODE":"*"},{"shard":"#ANY", "replica": 
"<3","sysprop.EC2AZ":"*"}]}'

curl 
'http://localhost:32080/solr/admin/collections?action=CREATE&name=.system&numShards=1&autoAddReplicas=true&tlogReplicas=3&nrtReplicas=0'


The K8SNODE and EC2AZ are passed in via -D args at start time.

The collection in question is created as:
curl 
'http://localhost:32080/solr/admin/collections?action=CREATE&name=testing.v1&collection.configName=testing.v1&tlogReplicas=6&numShards=2&autoAddReplicas=true&nrtReplicas=0'


##### Induce failure and cause issue in question #####

This all creates as expected, but it does ignore the policy.  We then delete a 
pair of nodes, let solr notice and recreate the replicas that existed on those 
nodes.  Bring up a pair of new nodes, it notices and moves the new replicas 
onto them. All exactly as it should except for the TYPE. screenshot of node 
tree:
https://monosnap.com/file/jGhVbfB1Aa5HpghcGMy6sh8Oe99Hjx


Log entry is the same for the .system and testing.v1 collections:


createReplica() {

  "operation":"ADDREPLICA",

  "collection":".system",

  "shard":"shard1",

  "core":".system_shard1_replica_n1",

  "state":"down",

  "base_url":"http://172.28.149.41:32080/solr";,

  "type":"NRT",

  "waitForFinalState":"false"}

2018-11-05 15:05:46.170 INFO  
(OverseerStateUpdate-245134345588899840-172.28.151.122:32080_solr-n_0000000000) 
[   ] o.a.s.c.o.SliceMutator createReplica() {

  "operation":"ADDREPLICA",

  "collection":".system",

  "shard":"shard1",

  "core":".system_shard1_replica_n3",

  "state":"down",

  "base_url":"http://172.28.154.245:32080/solr";,

  "type":"NRT",

  "waitForFinalState":"false"}

2018-11-05 15:05:46.194 INFO  
(OverseerStateUpdate-245134345588899840-172.28.151.122:32080_solr-n_0000000000) 
[   ] o.a.s.c.o.SliceMutator createReplica() {

  "operation":"ADDREPLICA",

  "collection":".system",

  "shard":"shard1",

  "core":".system_shard1_replica_n5",

  "state":"down",

  "base_url":"http://172.28.156.38:32080/solr";,

  "type":"NRT",

  "waitForFinalState":"false"} ​



I am struggling to find details in the docs that call out how to tell the 
cluster what ratio of TLOG and or PULL should be as it moves things around.  
Either way, if it is replacing a TLOG node it should replace it with a TLOG, 
right?

######### Cluster Policy Issue ############
The next issue is the cluster policy.  The goal is to make sure a given replica 
is not duplicated on a physical kubernetes node (K8SNODE) and that we keep 2 
copies of each shard in each availability zone.  Near as I can tell these rules 
are simply being ignored.  If I create rules within a collection it works as 
expected

(curl 
'http://localhost:32080/solr/admin/collections?action=CREATE&name=testing.v1&collection.configName=testing.v1&maxShardsPerNode=32&numShards=2&replicationFactor=6&autoAddReplicas=true&rule=shard:*,replica:<2,sysprop.K8SNODE:*&rule=shard:*,replica:>1,sysprop.EC2AZ:*')
 but then I cannot use TLOG replicas because when I try to create the 
collection with those rules in place it complains:

{

  "responseHeader":{

    "status":400,

    "QTime":134},

  "Operation create caused 
exception:":"org.apache.solr.common.SolrException:org.apache.solr.common.SolrException:
 TLOG or PULL replica types not supported with placement rules or cluster 
policies",

  "exception":{

    "msg":"TLOG or PULL replica types not supported with placement rules or 
cluster policies",

    "rspCode":400},

  "error":{

    "metadata":[

      "error-class","org.apache.solr.common.SolrException",

      "root-error-class","org.apache.solr.common.SolrException"],

    "msg":"TLOG or PULL replica types not supported with placement rules or 
cluster policies",

    "code":400}}



​Also of note, the docs do not call out in any meaningful way that you can or 
cannot use TLOG or PULL replicas with placement rules or cluster policies.  The 
fact that the docs DO call out that you cant NRT with TLOGs is not supported 
would seem in conflict with this concept.

Please let me know what additional information would be helpful in getting to 
the root cause on this.

Thank you,
Matthew












________________________________
ITHAKA email addresses for contractors are provided solely for the necessary 
and limited internal purposes of ITHAKA and are not intended for external 
communications.

Reply via email to