Hi Michael, I also verified the patch in SOLR-14471 with 8.4.1 and it fixed the issue with shards.preference=replica.location:local,replica.type:TLOG in my setting. Thanks!
Wei On Thu, May 21, 2020 at 12:09 PM Phill Campbell <sirgilli...@yahoo.com.invalid> wrote: > Yes, JVM heap settings. > > > On May 19, 2020, at 10:59 AM, Wei <weiwan...@gmail.com> wrote: > > > > Hi Phill, > > > > What is the RAM config you are referring to, JVM size? How is that > related > > to the load balancing, if each node has the same configuration? > > > > Thanks, > > Wei > > > > On Mon, May 18, 2020 at 3:07 PM Phill Campbell > > <sirgilli...@yahoo.com.invalid> wrote: > > > >> In my previous report I was configured to use as much RAM as possible. > >> With that configuration it seemed it was not load balancing. > >> So, I reconfigured and redeployed to use 1/4 the RAM. What a difference > >> for the better! > >> > >> 10.156.112.50 load average: 13.52, 10.56, 6.46 > >> 10.156.116.34 load average: 11.23, 12.35, 9.63 > >> 10.156.122.13 load average: 10.29, 12.40, 9.69 > >> > >> Very nice. > >> My tool that tests records RPS. In the “bad” configuration it was less > >> than 1 RPS. > >> NOW it is showing 21 RPS. > >> > >> > >> > http://10.156.112.50:10002/solr/admin/metrics?group=core&prefix=QUERY./select.requestTimes > >> < > >> > http://10.156.112.50:10002/solr/admin/metrics?group=core&prefix=QUERY./select.requestTimes > >>> > >> { > >> "responseHeader":{ > >> "status":0, > >> "QTime":161}, > >> "metrics":{ > >> "solr.core.BTS.shard1.replica_n2":{ > >> "QUERY./select.requestTimes":{ > >> "count":5723, > >> "meanRate":6.8163888639859085, > >> "1minRate":11.557013215119536, > >> "5minRate":8.760356217628159, > >> "15minRate":4.707624230995833, > >> "min_ms":0.131545, > >> "max_ms":388.710848, > >> "mean_ms":30.300492048215947, > >> "median_ms":6.336654, > >> "stddev_ms":51.527164088667035, > >> "p75_ms":35.427943, > >> "p95_ms":140.025957, > >> "p99_ms":230.533099, > >> "p999_ms":388.710848}}}} > >> > >> > >> > >> > http://10.156.122.13:10004/solr/admin/metrics?group=core&prefix=QUERY./select.requestTimes > >> < > >> > http://10.156.122.13:10004/solr/admin/metrics?group=core&prefix=QUERY./select.requestTimes > >>> > >> { > >> "responseHeader":{ > >> "status":0, > >> "QTime":11}, > >> "metrics":{ > >> "solr.core.BTS.shard2.replica_n8":{ > >> "QUERY./select.requestTimes":{ > >> "count":6469, > >> "meanRate":7.502581801189549, > >> "1minRate":12.211423085368564, > >> "5minRate":9.445681397767322, > >> "15minRate":5.216209798637846, > >> "min_ms":0.154691, > >> "max_ms":701.657394, > >> "mean_ms":34.2734699171445, > >> "median_ms":5.640378, > >> "stddev_ms":62.27649205954566, > >> "p75_ms":39.016371, > >> "p95_ms":156.997982, > >> "p99_ms":288.883028, > >> "p999_ms":538.368031}}}} > >> > >> > >> > http://10.156.116.34:10002/solr/admin/metrics?group=core&prefix=QUERY./select.requestTimes > >> < > >> > http://10.156.116.34:10002/solr/admin/metrics?group=core&prefix=QUERY./select.requestTimes > >>> > >> { > >> "responseHeader":{ > >> "status":0, > >> "QTime":67}, > >> "metrics":{ > >> "solr.core.BTS.shard3.replica_n16":{ > >> "QUERY./select.requestTimes":{ > >> "count":7109, > >> "meanRate":7.787524673806184, > >> "1minRate":11.88519763582083, > >> "5minRate":9.893315557386755, > >> "15minRate":5.620178363676527, > >> "min_ms":0.150887, > >> "max_ms":472.826462, > >> "mean_ms":32.184282366621204, > >> "median_ms":6.977733, > >> "stddev_ms":55.729908615189196, > >> "p75_ms":36.655011, > >> "p95_ms":151.12627, > >> "p99_ms":251.440162, > >> "p999_ms":472.826462}}}} > >> > >> > >> Compare that to the previous report and you can see the improvement. > >> So, note to myself. Figure out the sweet spot for RAM usage. Use too > much > >> and strange behavior is noticed. While using too much all the load > focused > >> on one box and query times slowed. > >> I did not see any OOM errors during any of this. > >> > >> Regards > >> > >> > >> > >>> On May 18, 2020, at 3:23 PM, Phill Campbell > >> <sirgilli...@yahoo.com.INVALID> wrote: > >>> > >>> I have been testing 8.5.2 and it looks like the load has moved but is > >> still on one machine. > >>> > >>> Setup: > >>> 3 physical machines. > >>> Each machine hosts 8 instances of Solr. > >>> Each instance of Solr hosts one replica. > >>> > >>> Another way to say it: > >>> Number of shards = 8. Replication factor = 3. > >>> > >>> Here is the cluster state. You can see that the leaders are well > >> distributed. > >>> > >>> {"TEST_COLLECTION":{ > >>> "pullReplicas":"0", > >>> "replicationFactor":"3", > >>> "shards":{ > >>> "shard1":{ > >>> "range":"80000000-9fffffff", > >>> "state":"active", > >>> "replicas":{ > >>> "core_node3":{ > >>> "core":"TEST_COLLECTION_shard1_replica_n1", > >>> "base_url":"http://10.156.122.13:10007/solr", > >>> "node_name":"10.156.122.13:10007_solr", > >>> "state":"active", > >>> "type":"NRT", > >>> "force_set_state":"false"}, > >>> "core_node5":{ > >>> "core":"TEST_COLLECTION_shard1_replica_n2", > >>> "base_url":"http://10.156.112.50:10002/solr", > >>> "node_name":"10.156.112.50:10002_solr", > >>> "state":"active", > >>> "type":"NRT", > >>> "force_set_state":"false", > >>> "leader":"true"}, > >>> "core_node7":{ > >>> "core":"TEST_COLLECTION_shard1_replica_n4", > >>> "base_url":"http://10.156.112.50:10006/solr", > >>> "node_name":"10.156.112.50:10006_solr", > >>> "state":"active", > >>> "type":"NRT", > >>> "force_set_state":"false"}}}, > >>> "shard2":{ > >>> "range":"a0000000-bfffffff", > >>> "state":"active", > >>> "replicas":{ > >>> "core_node9":{ > >>> "core":"TEST_COLLECTION_shard2_replica_n6", > >>> "base_url":"http://10.156.112.50:10003/solr", > >>> "node_name":"10.156.112.50:10003_solr", > >>> "state":"active", > >>> "type":"NRT", > >>> "force_set_state":"false"}, > >>> "core_node11":{ > >>> "core":"TEST_COLLECTION_shard2_replica_n8", > >>> "base_url":"http://10.156.122.13:10004/solr", > >>> "node_name":"10.156.122.13:10004_solr", > >>> "state":"active", > >>> "type":"NRT", > >>> "force_set_state":"false", > >>> "leader":"true"}, > >>> "core_node12":{ > >>> "core":"TEST_COLLECTION_shard2_replica_n10", > >>> "base_url":"http://10.156.116.34:10008/solr", > >>> "node_name":"10.156.116.34:10008_solr", > >>> "state":"active", > >>> "type":"NRT", > >>> "force_set_state":"false"}}}, > >>> "shard3":{ > >>> "range":"c0000000-dfffffff", > >>> "state":"active", > >>> "replicas":{ > >>> "core_node15":{ > >>> "core":"TEST_COLLECTION_shard3_replica_n13", > >>> "base_url":"http://10.156.122.13:10008/solr", > >>> "node_name":"10.156.122.13:10008_solr", > >>> "state":"active", > >>> "type":"NRT", > >>> "force_set_state":"false"}, > >>> "core_node17":{ > >>> "core":"TEST_COLLECTION_shard3_replica_n14", > >>> "base_url":"http://10.156.116.34:10005/solr", > >>> "node_name":"10.156.116.34:10005_solr", > >>> "state":"active", > >>> "type":"NRT", > >>> "force_set_state":"false"}, > >>> "core_node19":{ > >>> "core":"TEST_COLLECTION_shard3_replica_n16", > >>> "base_url":"http://10.156.116.34:10002/solr", > >>> "node_name":"10.156.116.34:10002_solr", > >>> "state":"active", > >>> "type":"NRT", > >>> "force_set_state":"false", > >>> "leader":"true"}}}, > >>> "shard4":{ > >>> "range":"e0000000-ffffffff", > >>> "state":"active", > >>> "replicas":{ > >>> "core_node20":{ > >>> "core":"TEST_COLLECTION_shard4_replica_n18", > >>> "base_url":"http://10.156.122.13:10001/solr", > >>> "node_name":"10.156.122.13:10001_solr", > >>> "state":"active", > >>> "type":"NRT", > >>> "force_set_state":"false"}, > >>> "core_node23":{ > >>> "core":"TEST_COLLECTION_shard4_replica_n21", > >>> "base_url":"http://10.156.116.34:10004/solr", > >>> "node_name":"10.156.116.34:10004_solr", > >>> "state":"active", > >>> "type":"NRT", > >>> "force_set_state":"false"}, > >>> "core_node25":{ > >>> "core":"TEST_COLLECTION_shard4_replica_n22", > >>> "base_url":"http://10.156.112.50:10001/solr", > >>> "node_name":"10.156.112.50:10001_solr", > >>> "state":"active", > >>> "type":"NRT", > >>> "force_set_state":"false", > >>> "leader":"true"}}}, > >>> "shard5":{ > >>> "range":"0-1fffffff", > >>> "state":"active", > >>> "replicas":{ > >>> "core_node27":{ > >>> "core":"TEST_COLLECTION_shard5_replica_n24", > >>> "base_url":"http://10.156.116.34:10007/solr", > >>> "node_name":"10.156.116.34:10007_solr", > >>> "state":"active", > >>> "type":"NRT", > >>> "force_set_state":"false"}, > >>> "core_node29":{ > >>> "core":"TEST_COLLECTION_shard5_replica_n26", > >>> "base_url":"http://10.156.122.13:10006/solr", > >>> "node_name":"10.156.122.13:10006_solr", > >>> "state":"active", > >>> "type":"NRT", > >>> "force_set_state":"false"}, > >>> "core_node31":{ > >>> "core":"TEST_COLLECTION_shard5_replica_n28", > >>> "base_url":"http://10.156.116.34:10006/solr", > >>> "node_name":"10.156.116.34:10006_solr", > >>> "state":"active", > >>> "type":"NRT", > >>> "force_set_state":"false", > >>> "leader":"true"}}}, > >>> "shard6":{ > >>> "range":"20000000-3fffffff", > >>> "state":"active", > >>> "replicas":{ > >>> "core_node33":{ > >>> "core":"TEST_COLLECTION_shard6_replica_n30", > >>> "base_url":"http://10.156.122.13:10002/solr", > >>> "node_name":"10.156.122.13:10002_solr", > >>> "state":"active", > >>> "type":"NRT", > >>> "force_set_state":"false", > >>> "leader":"true"}, > >>> "core_node35":{ > >>> "core":"TEST_COLLECTION_shard6_replica_n32", > >>> "base_url":"http://10.156.112.50:10008/solr", > >>> "node_name":"10.156.112.50:10008_solr", > >>> "state":"active", > >>> "type":"NRT", > >>> "force_set_state":"false"}, > >>> "core_node37":{ > >>> "core":"TEST_COLLECTION_shard6_replica_n34", > >>> "base_url":"http://10.156.116.34:10003/solr", > >>> "node_name":"10.156.116.34:10003_solr", > >>> "state":"active", > >>> "type":"NRT", > >>> "force_set_state":"false"}}}, > >>> "shard7":{ > >>> "range":"40000000-5fffffff", > >>> "state":"active", > >>> "replicas":{ > >>> "core_node39":{ > >>> "core":"TEST_COLLECTION_shard7_replica_n36", > >>> "base_url":"http://10.156.122.13:10003/solr", > >>> "node_name":"10.156.122.13:10003_solr", > >>> "state":"active", > >>> "type":"NRT", > >>> "force_set_state":"false", > >>> "leader":"true"}, > >>> "core_node41":{ > >>> "core":"TEST_COLLECTION_shard7_replica_n38", > >>> "base_url":"http://10.156.122.13:10005/solr", > >>> "node_name":"10.156.122.13:10005_solr", > >>> "state":"active", > >>> "type":"NRT", > >>> "force_set_state":"false"}, > >>> "core_node43":{ > >>> "core":"TEST_COLLECTION_shard7_replica_n40", > >>> "base_url":"http://10.156.112.50:10004/solr", > >>> "node_name":"10.156.112.50:10004_solr", > >>> "state":"active", > >>> "type":"NRT", > >>> "force_set_state":"false"}}}, > >>> "shard8":{ > >>> "range":"60000000-7fffffff", > >>> "state":"active", > >>> "replicas":{ > >>> "core_node45":{ > >>> "core":"TEST_COLLECTION_shard8_replica_n42", > >>> "base_url":"http://10.156.112.50:10007/solr", > >>> "node_name":"10.156.112.50:10007_solr", > >>> "state":"active", > >>> "type":"NRT", > >>> "force_set_state":"false"}, > >>> "core_node47":{ > >>> "core":"TEST_COLLECTION_shard8_replica_n44", > >>> "base_url":"http://10.156.112.50:10005/solr", > >>> "node_name":"10.156.112.50:10005_solr", > >>> "state":"active", > >>> "type":"NRT", > >>> "force_set_state":"false", > >>> "leader":"true"}, > >>> "core_node48":{ > >>> "core":"TEST_COLLECTION_shard8_replica_n46", > >>> "base_url":"http://10.156.116.34:10001/solr", > >>> "node_name":"10.156.116.34:10001_solr", > >>> "state":"active", > >>> "type":"NRT", > >>> "force_set_state":"false"}}}}, > >>> "router":{"name":"compositeId"}, > >>> "maxShardsPerNode":"1", > >>> "autoAddReplicas":"false", > >>> "nrtReplicas":"3", > >>> "tlogReplicas":"0”}} > >>> > >>> > >>> Running TOP on each machine while load tests have been running for 60 > >> minutes. > >>> > >>> 10.156.112.50 load average: 0.08, 0.35, 1.65 > >>> 10.156.116.34 load average: 24.71, 24.20, 20.65 > >>> 10.156.122.13 load average: 5.37, 3.21, 4.04 > >>> > >>> > >>> > >>> Here are the stats from each shard leader. > >>> > >>> > >> > http://10.156.112.50:10002/solr/admin/metrics?group=core&prefix=QUERY./select.requestTimes > >> < > >> > http://10.156.112.50:10002/solr/admin/metrics?group=core&prefix=QUERY./select.requestTimes > >>> > >>> { > >>> "responseHeader":{ > >>> "status":0, > >>> "QTime":2}, > >>> "metrics":{ > >>> "solr.core.BTS.shard1.replica_n2":{ > >>> "QUERY./select.requestTimes":{ > >>> "count":805, > >>> "meanRate":0.4385455794526838, > >>> "1minRate":0.5110237122383522, > >>> "5minRate":0.4671091682458005, > >>> "15minRate":0.4057871940723353, > >>> "min_ms":0.14047, > >>> "max_ms":12424.589645, > >>> "mean_ms":796.2194458711818, > >>> "median_ms":10.534906, > >>> "stddev_ms":2567.655224710497, > >>> "p75_ms":22.893306, > >>> "p95_ms":8316.33323, > >>> "p99_ms":12424.589645, > >>> "p999_ms":12424.589645}}}} > >>> > >>> > >> > http://10.156.122.13:10004/solr/admin/metrics?group=core&prefix=QUERY./select.requestTimes > >> < > >> > http://10.156.122.13:10004/solr/admin/metrics?group=core&prefix=QUERY./select.requestTimes > >>> > >>> { > >>> "responseHeader":{ > >>> "status":0, > >>> "QTime":2}, > >>> "metrics":{ > >>> "solr.core.BTS.shard2.replica_n8":{ > >>> "QUERY./select.requestTimes":{ > >>> "count":791, > >>> "meanRate":0.4244162938316224, > >>> "1minRate":0.4869749626003825, > >>> "5minRate":0.45856412657687656, > >>> "15minRate":0.3948063845907493, > >>> "min_ms":0.168369, > >>> "max_ms":11022.763933, > >>> "mean_ms":2572.0670957974603, > >>> "median_ms":1490.222885, > >>> "stddev_ms":2718.1710938804276, > >>> "p75_ms":4292.490478, > >>> "p95_ms":8487.18506, > >>> "p99_ms":8855.936617, > >>> "p999_ms":9589.218502}}}} > >>> > >>> > >> > http://10.156.116.34:10002/solr/admin/metrics?group=core&prefix=QUERY./select.requestTimes > >> < > >> > http://10.156.116.34:10002/solr/admin/metrics?group=core&prefix=QUERY./select.requestTimes > >>> > >>> { > >>> "responseHeader":{ > >>> "status":0, > >>> "QTime":83}, > >>> "metrics":{ > >>> "solr.core.BTS.shard3.replica_n16":{ > >>> "QUERY./select.requestTimes":{ > >>> "count":840, > >>> "meanRate":0.4335334453288775, > >>> "1minRate":0.5733683837779382, > >>> "5minRate":0.4931753679028527, > >>> "15minRate":0.42241330274699623, > >>> "min_ms":0.155939, > >>> "max_ms":18125.516406, > >>> "mean_ms":7097.942850416767, > >>> "median_ms":8136.862825, > >>> "stddev_ms":2382.041897221542, > >>> "p75_ms":8497.844088, > >>> "p95_ms":9642.430475, > >>> "p99_ms":9993.694346, > >>> "p999_ms":12207.982291}}}} > >>> > >>> > >> > http://10.156.112.50:10001/solr/admin/metrics?group=core&prefix=QUERY./select.requestTimes > >> < > >> > http://10.156.112.50:10001/solr/admin/metrics?group=core&prefix=QUERY./select.requestTimes > >>> > >>> { > >>> "responseHeader":{ > >>> "status":0, > >>> "QTime":3}, > >>> "metrics":{ > >>> "solr.core.BTS.shard4.replica_n22":{ > >>> "QUERY./select.requestTimes":{ > >>> "count":873, > >>> "meanRate":0.43420303985137254, > >>> "1minRate":0.4284437786865815, > >>> "5minRate":0.44020640429418745, > >>> "15minRate":0.40860871277629196, > >>> "min_ms":0.136658, > >>> "max_ms":11345.407699, > >>> "mean_ms":511.28573906464504, > >>> "median_ms":9.063677, > >>> "stddev_ms":2038.8104673512248, > >>> "p75_ms":20.270605, > >>> "p95_ms":8418.131442, > >>> "p99_ms":8904.78616, > >>> "p999_ms":10447.78365}}}} > >>> > >>> > >> > http://10.156.116.34:10006/solr/admin/metrics?group=core&prefix=QUERY./select.requestTimes > >> < > >> > http://10.156.116.34:10006/solr/admin/metrics?group=core&prefix=QUERY./select.requestTimes > >>> > >>> { > >>> "responseHeader":{ > >>> "status":0, > >>> "QTime":4}, > >>> "metrics":{ > >>> "solr.core.BTS.shard5.replica_n28":{ > >>> "QUERY./select.requestTimes":{ > >>> "count":863, > >>> "meanRate":0.4419375762840668, > >>> "1minRate":0.44487242228317025, > >>> "5minRate":0.45927613542085916, > >>> "15minRate":0.41056066296443494, > >>> "min_ms":0.158855, > >>> "max_ms":16669.411989, > >>> "mean_ms":6513.057114006753, > >>> "median_ms":8033.386692, > >>> "stddev_ms":3002.7487311308896, > >>> "p75_ms":8446.147616, > >>> "p95_ms":9888.641316, > >>> "p99_ms":13624.11926, > >>> "p999_ms":13624.11926}}}} > >>> > >>> > >> > http://10.156.122.13:10002/solr/admin/metrics?group=core&prefix=QUERY./select.requestTimes > >> < > >> > http://10.156.122.13:10002/solr/admin/metrics?group=core&prefix=QUERY./select.requestTimes > >>> > >>> { > >>> "responseHeader":{ > >>> "status":0, > >>> "QTime":2}, > >>> "metrics":{ > >>> "solr.core.BTS.shard6.replica_n30":{ > >>> "QUERY./select.requestTimes":{ > >>> "count":893, > >>> "meanRate":0.43301141185981046, > >>> "1minRate":0.4011485529441132, > >>> "5minRate":0.447654905093643, > >>> "15minRate":0.41489193746842407, > >>> "min_ms":0.161571, > >>> "max_ms":14716.828978, > >>> "mean_ms":2932.212133523417, > >>> "median_ms":1289.686481, > >>> "stddev_ms":3426.22045100954, > >>> "p75_ms":6230.031884, > >>> "p95_ms":8109.408506, > >>> "p99_ms":12904.515311, > >>> "p999_ms":12904.515311}}}} > >>> > >>> > >>> > >>> > >> > http://10.156.122.13:10003/solr/admin/metrics?group=core&prefix=QUERY./select.requestTimes > >> < > >> > http://10.156.122.13:10003/solr/admin/metrics?group=core&prefix=QUERY./select.requestTimes > >>> > >>> { > >>> "responseHeader":{ > >>> "status":0, > >>> "QTime":16}, > >>> "metrics":{ > >>> "solr.core.BTS.shard7.replica_n36":{ > >>> "QUERY./select.requestTimes":{ > >>> "count":962, > >>> "meanRate":0.46572438680661055, > >>> "1minRate":0.4974893681625287, > >>> "5minRate":0.49072296556429784, > >>> "15minRate":0.44138205926188756, > >>> "min_ms":0.164803, > >>> "max_ms":12481.82656, > >>> "mean_ms":2606.899631183513, > >>> "median_ms":1457.505387, > >>> "stddev_ms":3083.297183477969, > >>> "p75_ms":4072.543679, > >>> "p95_ms":8562.456178, > >>> "p99_ms":9351.230895, > >>> "p999_ms":10430.483813}}}} > >>> > >>> > >> > http://10.156.112.50:10005/solr/admin/metrics?group=core&prefix=QUERY./select.requestTimes > >> < > >> > http://10.156.112.50:10005/solr/admin/metrics?group=core&prefix=QUERY./select.requestTimes > >>> > >>> { > >>> "responseHeader":{ > >>> "status":0, > >>> "QTime":3}, > >>> "metrics":{ > >>> "solr.core.BTS.shard8.replica_n44":{ > >>> "QUERY./select.requestTimes":{ > >>> "count":904, > >>> "meanRate":0.4356001115451976, > >>> "1minRate":0.42906831311171356, > >>> "5minRate":0.4651312663377039, > >>> "15minRate":0.41812847342709225, > >>> "min_ms":0.089738, > >>> "max_ms":10857.092832, > >>> "mean_ms":304.52127270799156, > >>> "median_ms":7.098736, > >>> "stddev_ms":1544.5378594679773, > >>> "p75_ms":15.599817, > >>> "p95_ms":93.818662, > >>> "p99_ms":8510.757117, > >>> "p999_ms":9353.844994}}}} > >>> > >>> I restart all of the instances on “34” so that there are no leaders on > >> it. The load somewhat goes to the other box. > >>> > >>> 10.156.112.50 load average: 0.00, 0.16, 0.47 > >>> 10.156.116.34 load average: 17.00, 16.16, 17.07 > >>> 10.156.122.13 load average: 17.86, 17.49, 14.74 > >>> > >>> Box “50” is still doing nothing AND it is the leader of 4 of the 8 > >> shards. > >>> Box “13” is the leader of the remaining 4 shards. > >>> Box “34” is not the leader of any shard. > >>> > >>> I will continue to test, who knows, it may be something I am doing. > >> Maybe not enough RAM, etc…, so I am definitely leaving this open to the > >> possibility that I am not well configured for 8.5. > >>> > >>> Regards > >>> > >>> > >>> > >>> > >>>> On May 16, 2020, at 5:08 PM, Tomás Fernández Löbbe < > >> tomasflo...@gmail.com> wrote: > >>>> > >>>> I just backported Michael’s fix to be released in 8.5.2 > >>>> > >>>> On Fri, May 15, 2020 at 6:38 AM Michael Gibney < > >> mich...@michaelgibney.net> > >>>> wrote: > >>>> > >>>>> Hi Wei, > >>>>> SOLR-14471 has been merged, so this issue should be fixed in 8.6. > >>>>> Thanks for reporting the problem! > >>>>> Michael > >>>>> > >>>>> On Mon, May 11, 2020 at 7:51 PM Wei <weiwan...@gmail.com> wrote: > >>>>>> > >>>>>> Thanks Michael! Yes in each shard I have 10 Tlog replicas, no > other > >>>>> type > >>>>>> of replicas, and each Tlog replica is an individual solr instance on > >> its > >>>>>> own physical machine. In the jira you mentioned 'when "last place > >>>>> matches" > >>>>>> == "first place matches" – e.g. when shards.preference specified > >> matches > >>>>>> *all* available replicas'. My setting is > >>>>>> shards.preference=replica.location:local,replica.type:TLOG, > >>>>>> I also tried just shards.preference=replica.location:local and it > >> still > >>>>> has > >>>>>> the issue. Can you explain a bit more? > >>>>>> > >>>>>> On Mon, May 11, 2020 at 12:26 PM Michael Gibney < > >>>>> mich...@michaelgibney.net> > >>>>>> wrote: > >>>>>> > >>>>>>> FYI: https://issues.apache.org/jira/browse/SOLR-14471 > >>>>>>> Wei, assuming you have only TLOG replicas, your "last place" > matches > >>>>>>> (to which the random fallback ordering would not be applied -- see > >>>>>>> above issue) would be the same as the "first place" matches > selected > >>>>>>> for executing distributed requests. > >>>>>>> > >>>>>>> > >>>>>>> On Mon, May 11, 2020 at 1:49 PM Michael Gibney > >>>>>>> <mich...@michaelgibney.net> wrote: > >>>>>>>> > >>>>>>>> Wei, probably no need to answer my earlier questions; I think I > see > >>>>>>>> the problem here, and believe it is indeed a bug, introduced in > 8.3. > >>>>>>>> Will file an issue and submit a patch shortly. > >>>>>>>> Michael > >>>>>>>> > >>>>>>>> On Mon, May 11, 2020 at 12:49 PM Michael Gibney > >>>>>>>> <mich...@michaelgibney.net> wrote: > >>>>>>>>> > >>>>>>>>> Hi Wei, > >>>>>>>>> > >>>>>>>>> In considering this problem, I'm stumbling a bit on terminology > >>>>>>>>> (particularly, where you mention "nodes", I think you're > referring > >>>>> to > >>>>>>>>> "replicas"?). Could you confirm that you have 10 TLOG replicas > per > >>>>>>>>> shard, for each of 6 shards? How many *nodes* (i.e., running solr > >>>>>>>>> server instances) do you have, and what is the replica placement > >>>>> like > >>>>>>>>> across those nodes? What, if any, non-TLOG replicas do you have > per > >>>>>>>>> shard (not that it's necessarily relevant, but just to get a > >>>>> complete > >>>>>>>>> picture of the situation)? > >>>>>>>>> > >>>>>>>>> If you're able without too much trouble, can you determine what > the > >>>>>>>>> behavior is like on Solr 8.3? (there were different changes > >>>>> introduced > >>>>>>>>> to potentially relevant code in 8.3 and 8.4, and knowing whether > >>>>> the > >>>>>>>>> behavior you're observing manifests on 8.3 would help narrow down > >>>>>>>>> where to look for an explanation). > >>>>>>>>> > >>>>>>>>> Michael > >>>>>>>>> > >>>>>>>>> On Fri, May 8, 2020 at 7:34 PM Wei <weiwan...@gmail.com> wrote: > >>>>>>>>>> > >>>>>>>>>> Update: after I remove the shards.preference parameter from > >>>>>>>>>> solrconfig.xml, issue is gone and internal shard requests are > >>>>> now > >>>>>>>>>> balanced. The same parameter works fine with solr 7.6. Still > not > >>>>>>> sure of > >>>>>>>>>> the root cause, but I observed a strange coincidence: the nodes > >>>>> that > >>>>>>> are > >>>>>>>>>> most frequently picked for shard requests are the first node in > >>>>> each > >>>>>>> shard > >>>>>>>>>> returned from the CLUSTERSTATUS api. Seems something wrong with > >>>>>>> shuffling > >>>>>>>>>> equally compared nodes when shards.preference is set. Will > >>>>> report > >>>>>>> back if > >>>>>>>>>> I find more. > >>>>>>>>>> > >>>>>>>>>> On Mon, Apr 27, 2020 at 5:59 PM Wei <weiwan...@gmail.com> > wrote: > >>>>>>>>>> > >>>>>>>>>>> Hi Eric, > >>>>>>>>>>> > >>>>>>>>>>> I am measuring the number of shard requests, and it's for query > >>>>>>> only, no > >>>>>>>>>>> indexing requests. I have an external load balancer and see > >>>>> each > >>>>>>> node > >>>>>>>>>>> received about the equal number of external queries. However > >>>>> for > >>>>>>> the > >>>>>>>>>>> internal shard queries, the distribution is uneven: 6 nodes > >>>>>>> (one in > >>>>>>>>>>> each shard, some of them are leaders and some are non-leaders > >>>>> ) > >>>>>>> gets about > >>>>>>>>>>> 80% of the shard requests, the other 54 nodes gets about 20% of > >>>>>>> the shard > >>>>>>>>>>> requests. I checked a few other parameters set: > >>>>>>>>>>> > >>>>>>>>>>> -Dsolr.disable.shardsWhitelist=true > >>>>>>>>>>> shards.preference=replica.location:local,replica.type:TLOG > >>>>>>>>>>> > >>>>>>>>>>> Nothing seems to cause the strange behavior. Any suggestions > >>>>> how > >>>>>>> to > >>>>>>>>>>> debug this? > >>>>>>>>>>> > >>>>>>>>>>> -Wei > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> On Mon, Apr 27, 2020 at 5:42 PM Erick Erickson < > >>>>>>> erickerick...@gmail.com> > >>>>>>>>>>> wrote: > >>>>>>>>>>> > >>>>>>>>>>>> Wei: > >>>>>>>>>>>> > >>>>>>>>>>>> How are you measuring utilization here? The number of incoming > >>>>>>> requests > >>>>>>>>>>>> or CPU? > >>>>>>>>>>>> > >>>>>>>>>>>> The leader for each shard are certainly handling all of the > >>>>>>> indexing > >>>>>>>>>>>> requests since they’re TLOG replicas, so that’s one thing that > >>>>>>> might > >>>>>>>>>>>> skewing your measurements. > >>>>>>>>>>>> > >>>>>>>>>>>> Best, > >>>>>>>>>>>> Erick > >>>>>>>>>>>> > >>>>>>>>>>>>> On Apr 27, 2020, at 7:13 PM, Wei <weiwan...@gmail.com> > >>>>> wrote: > >>>>>>>>>>>>> > >>>>>>>>>>>>> Hi everyone, > >>>>>>>>>>>>> > >>>>>>>>>>>>> I have a strange issue after upgrade from 7.6.0 to 8.4.1. My > >>>>>>> cloud has 6 > >>>>>>>>>>>>> shards with 10 TLOG replicas each shard. After upgrade I > >>>>>>> noticed that > >>>>>>>>>>>> one > >>>>>>>>>>>>> of the replicas in each shard is handling most of the > >>>>>>> distributed shard > >>>>>>>>>>>>> requests, so 6 nodes are heavily loaded while other nodes > >>>>> are > >>>>>>> idle. > >>>>>>>>>>>> There > >>>>>>>>>>>>> is no change in shard handler configuration: > >>>>>>>>>>>>> > >>>>>>>>>>>>> <shardHandlerFactory name="shardHandlerFactory" class= > >>>>>>>>>>>>> "HttpShardHandlerFactory"> > >>>>>>>>>>>>> > >>>>>>>>>>>>> <int name="socketTimeout">30000</int> > >>>>>>>>>>>>> > >>>>>>>>>>>>> <int name="connTimeout">30000</int> > >>>>>>>>>>>>> > >>>>>>>>>>>>> <int name="maxConnectionsPerHost">500</int> > >>>>>>>>>>>>> > >>>>>>>>>>>>> </shardHandlerFactory> > >>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>> What could cause the unbalanced internal distributed > >>>>> request? > >>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>> Thanks in advance. > >>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>> Wei > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>> > >>>>> > >>> > >> > >> > >