Metric Trigger not being recognised & picked up

2020-10-22 Thread Jonathan Tan
Hi All

I've been trying to get a metric trigger set up in SolrCloud 8.4.1, but
it's not working, and was hoping for some help.

I've created a metric trigger using this:

```
POST /solr/admin/autoscaling {
  "set-trigger": {
"name": "metric_trigger",
"event": "metric",
"waitFor": "10s",
"metric": "metrics:solr.jvm:os.systemCpuLoad",
"above": 0.7,
"preferredOperation": "MOVEREPLICA",
"enabled": true
  }
}
```

And I get a successful response.

I can also see the new trigger in the `files -> tree -> autoscaling.json`.

However, I don't see any difference in the logs (I had the autoscaling
logging set to debug), and it's definitely not moving any replicas around
when under load, and the node is consistently in the > 85% overall
systemCpuLoad. (I can see this as well when I use the `/metrics` endpoint
with the above key.)


I then restarted all the nodes, and saw this error on startup, saying it
couldn't set the state during a restore, with the worrying part saying that
it is discarding the trigger...

I'd really like some help with this.

We've been seeing that out of the 3 nodes, there's always - seemingly
randomly - massively utilised on CPU (maxed out 8 cores, and it's not
always the one with overseer), so we were hoping that we could let the
Metric Trigger sort it out in the short term.

```
2020-10-22 23:03:19.905 ERROR (ScheduledTrigger-7-thread-3) [   ]
o.a.s.c.a.ScheduledTriggers Error restoring trigger state jvm_cpu_trigger
=> java.lang.NullPointerException
at
org.apache.solr.cloud.autoscaling.MetricTrigger.setState(MetricTrigger.java:94)
java.lang.NullPointerException: null
at
org.apache.solr.cloud.autoscaling.MetricTrigger.setState(MetricTrigger.java:94)
~[?:?]
at
org.apache.solr.cloud.autoscaling.TriggerBase.restoreState(TriggerBase.java:279)
~[?:?]
at
org.apache.solr.cloud.autoscaling.ScheduledTriggers$TriggerWrapper.run(ScheduledTriggers.java:638)
~[?:?]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
~[?:?]
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305) ~[?:?]
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305)
~[?:?]
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
~[?:?]
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
~[?:?]
at java.lang.Thread.run(Thread.java:834) [?:?]
2020-10-22 23:03:19.912 ERROR (ScheduledTrigger-7-thread-1) [   ]
o.a.s.c.a.ScheduledTriggers Failed to re-play event, discarding: {
  "id":"dd2ebf3d56bTboddkoovyjxdvy1hauq2zskpt",
  "source":"metric_trigger",
  "eventTime":15199552918891,
  "eventType":"METRIC",
  "properties":{

"node":{"mycoll-solr-solr-service-1.mycoll-solr-solr-service-headless.mycoll-solr-test:8983_solr":0.7322834645669292},
"_dequeue_time_":261690991035,
"metric":"metrics:solr.jvm:os.systemCpuLoad",
"preferredOperation":"MOVEREPLICA",
"_enqueue_time_":15479182216601,
"requestedOps":[{
"action":"MOVEREPLICA",

"hints":{"SRC_NODE":["mycoll-solr-solr-service-1.mycoll-solr-solr-service-headless.mycoll-solr-test:8983_solr"]}}],
"replaying":true}}
2020-10-22 23:03:19.913 INFO
 
(OverseerStateUpdate-144115201265369088-mycoll-solr-solr-service-0.mycoll-solr-solr-service-headless.mycoll-solr-test:8983_solr-n_000199)
[   ] o.a.s.c.o.SliceMutator createReplica() {
  "operation":"addreplica",
  "collection":"mycoll-2",
  "shard":"shard5",
  "core":"mycoll-2_shard5_replica_n122",
  "state":"down",
  "base_url":"
http://mycoll-solr-solr-service-0.mycoll-solr-solr-service-headless.mycoll-solr-test:8983/solr
",

"node_name":"mycoll-solr-solr-service-0.mycoll-solr-solr-service-headless.mycoll-solr-test:8983_solr",
  "type":"NRT"}
2020-10-22 23:03:19.921 ERROR (ScheduledTrigger-7-thread-1) [   ]
o.a.s.c.a.ScheduledTriggers Error restoring trigger state metric_trigger =>
java.lang.NullPointerException
at
org.apache.solr.cloud.autoscaling.MetricTrigger.setState(MetricTrigger.java:94)
java.lang.NullPointerException: null
at
org.apache.solr.cloud.autoscaling.MetricTrigger.setState(MetricTrigger.java:94)
~[?:?]
at
org.apache.solr.cloud.autoscaling.TriggerBase.restoreState(TriggerBase.java:279)
~[?:?]
at
org.apache.solr.cloud.autoscaling.ScheduledTriggers$TriggerWrapper.run(ScheduledTriggers.java:638)
~[?:?]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
~[?:?]
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305) ~[?:?]
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305)
~[?:?]
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
~[?:?]
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
~[?:?]
at java.lang.Thread.run(Thread.java:834) [?:?]

```


Any help please?
Thank you
Jonathan


Massively unbalanced CPU by different SOLR Nodes

2020-10-22 Thread Jonathan Tan
Hi,

We've got a 3 node SolrCloud cluster running on GKE, each on their own kube
node (which is in itself, relatively empty of other things).

Our collection has ~18m documents of 36gb in size, split into 6 shards with
2 replicas each, and they are evenly distributed across the 3 nodes. Our
JVMs are currently sized to ~14gb min & max , and they are running on SSDs.


[image: Screen Shot 2020-10-23 at 2.15.48 pm.png]

Graph also available here: https://pasteboard.co/JwUQ98M.png

Under perf testing of ~30 requests per second, we start seeing really bad
response times (around 3s in the 90th percentile, and *one* of the nodes
would be fully maxed out on CPU. At about 15 requests per second, our
response times are reasonable enough for our purposes (~0.8-1.1s), but as
is visible in the graph, it's definitely *not* an even distribution of the
CPU load. One of the nodes is running at around 13cores, whilst the other 2
are running at ~8cores and 6 cores respectively.

We've tracked in our monitoring tools that the 3 nodes *are* getting an
even distribution of requests, and we're using a Kube service which is in
itself a fairly well known tool for load balancing pods. We've also used
kube services heaps for load balancing of other apps and haven't seen such
a problem, so we doubt it's the load balancer that is the problem.

All 3 nodes are built from the same kubernetes statefulset deployment so
they'd all have the same configuration & setup. Additionally, over the
course of the day, it may suddenly change so that an entirely different
node is the one that is majorly overloaded on CPU.

All this is happening only under queries, and we are doing no indexing at
that time.

We'd initially thought it might be the overseer that is being majorly
overloaded when under queries (although we were surprised) until we did
more testing and found that even the nodes that weren't overseer would
sometimes have that disparity. We'd also tried using the `ADDROLE` API to
force an overseer change in the middle of a test, and whilst the tree
updated to show that the overseer had changed, it made no difference to the
highest CPU load.

Directing queries directly to the non-busy nodes do actually give us back
decent response times.

We're quite puzzled by this and would really like some help figuring out
*why* the CPU on one is so much higher. I did try to get the jaeger tracing
working (we already have jaeger in our cluster), but we just kept getting
errors on startup with solr not being able to load the main function...


Thank you in advance!
Cheers
Jonathan


Metric Trigger not being recognised & picked up

2020-10-23 Thread Jonathan Tan
Hi All

I've been trying to get a metric trigger set up in SolrCloud 8.4.1, but
it's not working, and was hoping for some help.

I've created a metric trigger using this:

```
POST /solr/admin/autoscaling {
  "set-trigger": {
"name": "metric_trigger",
"event": "metric",
"waitFor": "10s",
"metric": "metrics:solr.jvm:os.systemCpuLoad",
"above": 0.7,
"preferredOperation": "MOVEREPLICA",
"enabled": true
  }
}
```

And I get a successful response.

I can also see the new trigger in the `files -> tree -> autoscaling.json`.

However, I don't see any difference in the logs (I had the autoscaling
logging set to debug), and it's definitely not moving any replicas around
when under load, and the node is consistently in the > 85% overall
systemCpuLoad. (I can see this as well when I use the `/metrics` endpoint
with the above key.)


I then restarted all the nodes, and saw this error on startup, saying it
couldn't set the state during a restore, with the worrying part saying that
it is discarding the trigger...

I'd really like some help with this.

We've been seeing that out of the 3 nodes, there's always - seemingly
randomly - massively utilised on CPU (maxed out 8 cores, and it's not
always the one with overseer), so we were hoping that we could let the
Metric Trigger sort it out in the short term.

```
2020-10-22 23:03:19.905 ERROR (ScheduledTrigger-7-thread-3) [   ]
o.a.s.c.a.ScheduledTriggers Error restoring trigger state jvm_cpu_trigger
=> java.lang.NullPointerException
at
org.apache.solr.cloud.autoscaling.MetricTrigger.setState(MetricTrigger.java:94)
java.lang.NullPointerException: null
at
org.apache.solr.cloud.autoscaling.MetricTrigger.setState(MetricTrigger.java:94)
~[?:?]
at
org.apache.solr.cloud.autoscaling.TriggerBase.restoreState(TriggerBase.java:279)
~[?:?]
at
org.apache.solr.cloud.autoscaling.ScheduledTriggers$TriggerWrapper.run(ScheduledTriggers.java:638)
~[?:?]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
~[?:?]
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305) ~[?:?]
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305)
~[?:?]
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
~[?:?]
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
~[?:?]
at java.lang.Thread.run(Thread.java:834) [?:?]
2020-10-22 23:03:19.912 ERROR (ScheduledTrigger-7-thread-1) [   ]
o.a.s.c.a.ScheduledTriggers Failed to re-play event, discarding: {
  "id":"dd2ebf3d56bTboddkoovyjxdvy1hauq2zskpt",
  "source":"metric_trigger",
  "eventTime":15199552918891,
  "eventType":"METRIC",
  "properties":{

"node":{"mycoll-solr-solr-service-1.mycoll-solr-solr-service-headless.mycoll-solr-test:8983_solr":0.7322834645669292},
"_dequeue_time_":261690991035,
"metric":"metrics:solr.jvm:os.systemCpuLoad",
"preferredOperation":"MOVEREPLICA",
"_enqueue_time_":15479182216601,
"requestedOps":[{
"action":"MOVEREPLICA",

"hints":{"SRC_NODE":["mycoll-solr-solr-service-1.mycoll-solr-solr-service-headless.mycoll-solr-test:8983_solr"]}}],
"replaying":true}}
2020-10-22 23:03:19.913 INFO
 
(OverseerStateUpdate-144115201265369088-mycoll-solr-solr-service-0.mycoll-solr-solr-service-headless.mycoll-solr-test:8983_solr-n_000199)
[   ] o.a.s.c.o.SliceMutator createReplica() {
  "operation":"addreplica",
  "collection":"mycoll-2",
  "shard":"shard5",
  "core":"mycoll-2_shard5_replica_n122",
  "state":"down",
  "base_url":"
http://mycoll-solr-solr-service-0.mycoll-solr-solr-service-headless.mycoll-solr-test:8983/solr
",

"node_name":"mycoll-solr-solr-service-0.mycoll-solr-solr-service-headless.mycoll-solr-test:8983_solr",
  "type":"NRT"}
2020-10-22 23:03:19.921 ERROR (ScheduledTrigger-7-thread-1) [   ]
o.a.s.c.a.ScheduledTriggers Error restoring trigger state metric_trigger =>
java.lang.NullPointerException
at
org.apache.solr.cloud.autoscaling.MetricTrigger.setState(MetricTrigger.java:94)
java.lang.NullPointerException: null
at
org.apache.solr.cloud.autoscaling.MetricTrigger.setState(MetricTrigger.java:94)
~[?:?]
at
org.apache.solr.cloud.autoscaling.TriggerBase.restoreState(TriggerBase.java:279)
~[?:?]
at
org.apache.solr.cloud.autoscaling.ScheduledTriggers$TriggerWrapper.run(ScheduledTriggers.java:638)
~[?:?]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
~[?:?]
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305) ~[?:?]
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305)
~[?:?]
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
~[?:?]
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
~[?:?]
at java.lang.Thread.run(Thread.java:834) [?:?]

```


Any help please?
Thank you
Jonathan


Re: Massively unbalanced CPU by different SOLR Nodes

2020-10-24 Thread Jonathan Tan
Hi Shalin,

Yes we are as a matter of fact! We're preferring local replicas, but given
the description of the bug, is it possible that that's forcing some other
behaviour where - given equal shards - it will always route to the same
shard?
Not 100% sure if I understand it. That said, thank you, we'll try with Solr
8.6 and I'll report back.

Cheers
Jonathan
'

On Sat, Oct 24, 2020 at 11:37 PM Shalin Shekhar Mangar <
shalinman...@gmail.com> wrote:

> Hi Jonathan,
>
> Are you using the "shards.preference" parameter by any chance? There is a
> bug that causes uneven request distribution during fan-out. Can you check
> the number of requests using the /admin/metrics API? Look for the /select
> handler's distrib and local request times for each core in the node.
> Compare those across different nodes.
>
> The bug I refer to is https://issues.apache.org/jira/browse/SOLR-14471 and
> it is fixed in Solr 8.5.2
>
> On Fri, Oct 23, 2020 at 9:05 AM Jonathan Tan  wrote:
>
> > Hi,
> >
> > We've got a 3 node SolrCloud cluster running on GKE, each on their own
> > kube node (which is in itself, relatively empty of other things).
> >
> > Our collection has ~18m documents of 36gb in size, split into 6 shards
> > with 2 replicas each, and they are evenly distributed across the 3 nodes.
> > Our JVMs are currently sized to ~14gb min & max , and they are running on
> > SSDs.
> >
> >
> > [image: Screen Shot 2020-10-23 at 2.15.48 pm.png]
> >
> > Graph also available here: https://pasteboard.co/JwUQ98M.png
> >
> > Under perf testing of ~30 requests per second, we start seeing really bad
> > response times (around 3s in the 90th percentile, and *one* of the nodes
> > would be fully maxed out on CPU. At about 15 requests per second, our
> > response times are reasonable enough for our purposes (~0.8-1.1s), but as
> > is visible in the graph, it's definitely *not* an even distribution of
> the
> > CPU load. One of the nodes is running at around 13cores, whilst the
> other 2
> > are running at ~8cores and 6 cores respectively.
> >
> > We've tracked in our monitoring tools that the 3 nodes *are* getting an
> > even distribution of requests, and we're using a Kube service which is in
> > itself a fairly well known tool for load balancing pods. We've also used
> > kube services heaps for load balancing of other apps and haven't seen
> such
> > a problem, so we doubt it's the load balancer that is the problem.
> >
> > All 3 nodes are built from the same kubernetes statefulset deployment so
> > they'd all have the same configuration & setup. Additionally, over the
> > course of the day, it may suddenly change so that an entirely different
> > node is the one that is majorly overloaded on CPU.
> >
> > All this is happening only under queries, and we are doing no indexing at
> > that time.
> >
> > We'd initially thought it might be the overseer that is being majorly
> > overloaded when under queries (although we were surprised) until we did
> > more testing and found that even the nodes that weren't overseer would
> > sometimes have that disparity. We'd also tried using the `ADDROLE` API to
> > force an overseer change in the middle of a test, and whilst the tree
> > updated to show that the overseer had changed, it made no difference to
> the
> > highest CPU load.
> >
> > Directing queries directly to the non-busy nodes do actually give us back
> > decent response times.
> >
> > We're quite puzzled by this and would really like some help figuring out
> > *why* the CPU on one is so much higher. I did try to get the jaeger
> tracing
> > working (we already have jaeger in our cluster), but we just kept getting
> > errors on startup with solr not being able to load the main function...
> >
> >
> > Thank you in advance!
> > Cheers
> > Jonathan
> >
> >
> >
> >
>
> --
> Regards,
> Shalin Shekhar Mangar.
>


Re: Massively unbalanced CPU by different SOLR Nodes

2020-10-26 Thread Jonathan Tan
Hi Shalin,

Moving to 8.6.3 fixed it!

Thank you very much for that. :)
We'd considered an upgrade - just because - but we won't have done so so
quickly without your information.

Cheers

On Sat, Oct 24, 2020 at 11:37 PM Shalin Shekhar Mangar <
shalinman...@gmail.com> wrote:

> Hi Jonathan,
>
> Are you using the "shards.preference" parameter by any chance? There is a
> bug that causes uneven request distribution during fan-out. Can you check
> the number of requests using the /admin/metrics API? Look for the /select
> handler's distrib and local request times for each core in the node.
> Compare those across different nodes.
>
> The bug I refer to is https://issues.apache.org/jira/browse/SOLR-14471 and
> it is fixed in Solr 8.5.2
>
> On Fri, Oct 23, 2020 at 9:05 AM Jonathan Tan  wrote:
>
> > Hi,
> >
> > We've got a 3 node SolrCloud cluster running on GKE, each on their own
> > kube node (which is in itself, relatively empty of other things).
> >
> > Our collection has ~18m documents of 36gb in size, split into 6 shards
> > with 2 replicas each, and they are evenly distributed across the 3 nodes.
> > Our JVMs are currently sized to ~14gb min & max , and they are running on
> > SSDs.
> >
> >
> > [image: Screen Shot 2020-10-23 at 2.15.48 pm.png]
> >
> > Graph also available here: https://pasteboard.co/JwUQ98M.png
> >
> > Under perf testing of ~30 requests per second, we start seeing really bad
> > response times (around 3s in the 90th percentile, and *one* of the nodes
> > would be fully maxed out on CPU. At about 15 requests per second, our
> > response times are reasonable enough for our purposes (~0.8-1.1s), but as
> > is visible in the graph, it's definitely *not* an even distribution of
> the
> > CPU load. One of the nodes is running at around 13cores, whilst the
> other 2
> > are running at ~8cores and 6 cores respectively.
> >
> > We've tracked in our monitoring tools that the 3 nodes *are* getting an
> > even distribution of requests, and we're using a Kube service which is in
> > itself a fairly well known tool for load balancing pods. We've also used
> > kube services heaps for load balancing of other apps and haven't seen
> such
> > a problem, so we doubt it's the load balancer that is the problem.
> >
> > All 3 nodes are built from the same kubernetes statefulset deployment so
> > they'd all have the same configuration & setup. Additionally, over the
> > course of the day, it may suddenly change so that an entirely different
> > node is the one that is majorly overloaded on CPU.
> >
> > All this is happening only under queries, and we are doing no indexing at
> > that time.
> >
> > We'd initially thought it might be the overseer that is being majorly
> > overloaded when under queries (although we were surprised) until we did
> > more testing and found that even the nodes that weren't overseer would
> > sometimes have that disparity. We'd also tried using the `ADDROLE` API to
> > force an overseer change in the middle of a test, and whilst the tree
> > updated to show that the overseer had changed, it made no difference to
> the
> > highest CPU load.
> >
> > Directing queries directly to the non-busy nodes do actually give us back
> > decent response times.
> >
> > We're quite puzzled by this and would really like some help figuring out
> > *why* the CPU on one is so much higher. I did try to get the jaeger
> tracing
> > working (we already have jaeger in our cluster), but we just kept getting
> > errors on startup with solr not being able to load the main function...
> >
> >
> > Thank you in advance!
> > Cheers
> > Jonathan
> >
> >
> >
> >
>
> --
> Regards,
> Shalin Shekhar Mangar.
>


Re: solrcloud with EKS kubernetes

2020-12-16 Thread Jonathan Tan
Hi Abhishek,

We're running Solr Cloud 8.6 on GKE.
3 node cluster, running 4 cpus (configured) and 8gb of min & max JVM
configured, all with anti-affinity so they never exist on the same node.
It's got 2 collections of ~13documents each, 6 shards, 3 replicas each,
disk usage on each node is ~54gb (we've got all the shards replicated to
all nodes)

We're also using a 200gb zonal SSD, which *has* been necessary just so that
we've got the right IOPS & bandwidth. (That's approximately 6000 IOPS for
read & write each, and 96MB/s for read & write each)

Various lessons learnt...
You definitely don't want them ever on the same kubernetes node. From a
resilience perspective, yes, but also when one SOLR node gets busy, they
tend to all get busy, so now you'll have resource contention. Recovery can
also get very busy and resource intensive, and again, sitting on the same
node is problematic. We also saw the need to move to SSDs because of how
IOPS bound we were.

Did I mention use SSDs? ;)

Good luck!

On Mon, Dec 14, 2020 at 5:34 PM Abhishek Mishra 
wrote:

> Hi Houston,
> Sorry for the late reply. Each shard has a 9GB size around.
> Yeah, we are providing enough resources to pods. We are currently
> using c5.4xlarge.
> XMS and XMX is 16GB. The machine is having 32 GB and 16 core.
> No, I haven't run it outside Kubernetes. But I do have colleagues who did
> the same on 7.2 and didn't face any issue regarding it.
> Storage volume is gp2 50GB.
> It's not the search query where we are facing inconsistencies or timeouts.
> Seems some internal admin APIs sometimes have issues. So while adding new
> replica in clusters sometimes result in inconsistencies. Like recovery
> takes some time more than one hour.
>
> Regards,
> Abhishek
>
> On Thu, Dec 10, 2020 at 10:23 AM Houston Putman 
> wrote:
>
> > Hello Abhishek,
> >
> > It's really hard to provide any advice without knowing any information
> > about your setup/usage.
> >
> > Are you giving your Solr pods enough resources on EKS?
> > Have you run Solr in the same configuration outside of kubernetes in the
> > past without timeouts?
> > What type of storage volumes are you using to store your data?
> > Are you using headless services to connect your Solr Nodes, or ingresses?
> >
> > If this is the first time that you are using this data + Solr
> > configuration, maybe it's just that your data within Solr isn't optimized
> > for the type of queries that you are doing.
> > If you have run it successfully in the past outside of Kubernetes, then I
> > would look at the resources that you are giving your pods and the storage
> > volumes that you are using.
> > If you are using Ingresses, that might be causing slow connections
> between
> > nodes, or between your client and Solr.
> >
> > - Houston
> >
> > On Wed, Dec 9, 2020 at 3:24 PM Abhishek Mishra 
> > wrote:
> >
> > > Hello guys,
> > > We are kind of facing some of the issues(Like timeout etc.) which are
> > very
> > > inconsistent. By any chance can it be related to EKS? We are using solr
> > 7.7
> > > and zookeeper 3.4.13. Should we move to ECS?
> > >
> > > Regards,
> > > Abhishek
> > >
> >
>


Re: solrcloud with EKS kubernetes

2020-12-26 Thread Jonathan Tan
Hi Abhishek,

Merry Christmas to you too!
I think it's really a question regarding your indexing speed NFRs.

Have you had a chance to take a look at your IOPS & write bytes/second
graphs for that host & PVC?

I'd suggest that's the first thing to go look at, so that you can find out
whether you're actually IOPS bound or not.
If you are, then it becomes a question of *how* you're indexing, and
whether that can be "slowed down" or not.



On Thu, Dec 24, 2020 at 5:55 PM Abhishek Mishra 
wrote:

> Hi Jonathan,
> Merry Christmas.
> Thanks for the suggestion. To manage IOPS can we do something on
> rate-limiting behalf?
>
> Regards,
> Abhishek
>
>
> On Thu, Dec 17, 2020 at 5:07 AM Jonathan Tan  wrote:
>
> > Hi Abhishek,
> >
> > We're running Solr Cloud 8.6 on GKE.
> > 3 node cluster, running 4 cpus (configured) and 8gb of min & max JVM
> > configured, all with anti-affinity so they never exist on the same node.
> > It's got 2 collections of ~13documents each, 6 shards, 3 replicas each,
> > disk usage on each node is ~54gb (we've got all the shards replicated to
> > all nodes)
> >
> > We're also using a 200gb zonal SSD, which *has* been necessary just so
> that
> > we've got the right IOPS & bandwidth. (That's approximately 6000 IOPS for
> > read & write each, and 96MB/s for read & write each)
> >
> > Various lessons learnt...
> > You definitely don't want them ever on the same kubernetes node. From a
> > resilience perspective, yes, but also when one SOLR node gets busy, they
> > tend to all get busy, so now you'll have resource contention. Recovery
> can
> > also get very busy and resource intensive, and again, sitting on the same
> > node is problematic. We also saw the need to move to SSDs because of how
> > IOPS bound we were.
> >
> > Did I mention use SSDs? ;)
> >
> > Good luck!
> >
> > On Mon, Dec 14, 2020 at 5:34 PM Abhishek Mishra 
> > wrote:
> >
> > > Hi Houston,
> > > Sorry for the late reply. Each shard has a 9GB size around.
> > > Yeah, we are providing enough resources to pods. We are currently
> > > using c5.4xlarge.
> > > XMS and XMX is 16GB. The machine is having 32 GB and 16 core.
> > > No, I haven't run it outside Kubernetes. But I do have colleagues who
> did
> > > the same on 7.2 and didn't face any issue regarding it.
> > > Storage volume is gp2 50GB.
> > > It's not the search query where we are facing inconsistencies or
> > timeouts.
> > > Seems some internal admin APIs sometimes have issues. So while adding
> new
> > > replica in clusters sometimes result in inconsistencies. Like recovery
> > > takes some time more than one hour.
> > >
> > > Regards,
> > > Abhishek
> > >
> > > On Thu, Dec 10, 2020 at 10:23 AM Houston Putman <
> houstonput...@gmail.com
> > >
> > > wrote:
> > >
> > > > Hello Abhishek,
> > > >
> > > > It's really hard to provide any advice without knowing any
> information
> > > > about your setup/usage.
> > > >
> > > > Are you giving your Solr pods enough resources on EKS?
> > > > Have you run Solr in the same configuration outside of kubernetes in
> > the
> > > > past without timeouts?
> > > > What type of storage volumes are you using to store your data?
> > > > Are you using headless services to connect your Solr Nodes, or
> > ingresses?
> > > >
> > > > If this is the first time that you are using this data + Solr
> > > > configuration, maybe it's just that your data within Solr isn't
> > optimized
> > > > for the type of queries that you are doing.
> > > > If you have run it successfully in the past outside of Kubernetes,
> > then I
> > > > would look at the resources that you are giving your pods and the
> > storage
> > > > volumes that you are using.
> > > > If you are using Ingresses, that might be causing slow connections
> > > between
> > > > nodes, or between your client and Solr.
> > > >
> > > > - Houston
> > > >
> > > > On Wed, Dec 9, 2020 at 3:24 PM Abhishek Mishra  >
> > > > wrote:
> > > >
> > > > > Hello guys,
> > > > > We are kind of facing some of the issues(Like timeout etc.) which
> are
> > > > very
> > > > > inconsistent. By any chance can it be related to EKS? We are using
> > solr
> > > > 7.7
> > > > > and zookeeper 3.4.13. Should we move to ECS?
> > > > >
> > > > > Regards,
> > > > > Abhishek
> > > > >
> > > >
> > >
> >
>