Erick and Shawn,
thank you very much for the very usefull information.

When I start to move from sigle Solr to cloud, I was planning to use the
cluster for very large collections.

But the collection that I said, will not grow that much, so I will downsize
shards.


Thanks for the information about load balancing. I will provide it.



Shawn, bellow I share the information that I hope will clarify.

Linux CentOS
Solr 6.6
64 Gb each box
6 Gb each node

The last time the node died was on 2019-08-11. It happens sometimes a week.


--------------
tail -f  node1/logs/solr_oom_killer-8983-2019-08-11_22_57_56.log
Running OOM killer script for process 38788 for Solr on port 8983
Killed process 38788
--------------


--------------
ls -ltr  node1/logs/archived/
total 82032
-rw-rw-r-- 1 solr solr 20973030 Aug  4 18:31 solr_gc.log.0
-rw-rw-r-- 1 solr solr 20973415 Aug  6 21:05 solr_gc.log.1
-rw-rw-r-- 1 solr solr 20971714 Aug  9 12:01 solr_gc.log.2
-rw-rw-r-- 1 solr solr 20971720 Aug 11 22:53 solr_gc.log.3
-rw-rw-r-- 1 solr solr    77096 Aug 11 22:57 solr_gc.log.4.current
-rw-rw-r-- 1 solr solr      364 Aug 11 22:57 solr-8983-console.log
--------------


--------------
tail -50   node1/logs/archived/solr_gc.log.4.current
 Metaspace       used 50496K, capacity 51788K, committed 53140K, reserved
1097728K
  class space    used 5001K, capacity 5263K, committed 5524K, reserved
1048576K
}
2019-08-11T22:57:39.231-0300: 802516.887: Total time for which application
threads were stopped: 12.5386815 seconds, Stopping threads took: 0.0001242
seconds
{Heap before GC invocations=34291 (full 252):
 par new generation   total 1310720K, used 1310719K [0x0000000640000000,
0x00000006a0000000, 0x00000006a0000000)
  eden space 1048576K, 100% used [0x0000000640000000, 0x0000000680000000,
0x0000000680000000)
  from space 262144K,  99% used [0x0000000690000000, 0x000000069ffffff8,
0x00000006a0000000)
  to   space 262144K,   0% used [0x0000000680000000, 0x0000000680000000,
0x0000000690000000)
 concurrent mark-sweep generation total 4718592K, used 4718592K
[0x00000006a0000000, 0x00000007c0000000, 0x00000007c0000000)
 Metaspace       used 50496K, capacity 51788K, committed 53140K, reserved
1097728K
  class space    used 5001K, capacity 5263K, committed 5524K, reserved
1048576K
2019-08-11T22:57:39.233-0300: 802516.889: [Full GC (Allocation Failure)
2019-08-11T22:57:39.233-0300: 802516.889: [CMS:
4718592K->4718591K(4718592K), 5.5779385 secs] 6029311K->6029311K(6029312K),
[Metaspace: 50496K->50496K(1097728K)], 5.5780863 secs] [Times: user=5.58
sys=0.00, real=5.58 secs]
Heap after GC invocations=34292 (full 253):
 par new generation   total 1310720K, used 1310719K [0x0000000640000000,
0x00000006a0000000, 0x00000006a0000000)
  eden space 1048576K,  99% used [0x0000000640000000, 0x000000067fffff68,
0x0000000680000000)
  from space 262144K,  99% used [0x0000000690000000, 0x000000069fffff18,
0x00000006a0000000)
  to   space 262144K,   0% used [0x0000000680000000, 0x0000000680000000,
0x0000000690000000)
 concurrent mark-sweep generation total 4718592K, used 4718591K
[0x00000006a0000000, 0x00000007c0000000, 0x00000007c0000000)
 Metaspace       used 50496K, capacity 51788K, committed 53140K, reserved
1097728K
  class space    used 5001K, capacity 5263K, committed 5524K, reserved
1048576K
}
2019-08-11T22:57:44.812-0300: 802522.469: Total time for which application
threads were stopped: 5.5805500 seconds, Stopping threads took: 0.0001295
seconds
{Heap before GC invocations=34292 (full 253):
 par new generation   total 1310720K, used 1310719K [0x0000000640000000,
0x00000006a0000000, 0x00000006a0000000)
  eden space 1048576K, 100% used [0x0000000640000000, 0x0000000680000000,
0x0000000680000000)
  from space 262144K,  99% used [0x0000000690000000, 0x000000069fffff98,
0x00000006a0000000)
  to   space 262144K,   0% used [0x0000000680000000, 0x0000000680000000,
0x0000000690000000)
 concurrent mark-sweep generation total 4718592K, used 4718591K
[0x00000006a0000000, 0x00000007c0000000, 0x00000007c0000000)
 Metaspace       used 50496K, capacity 51788K, committed 53140K, reserved
1097728K
  class space    used 5001K, capacity 5263K, committed 5524K, reserved
1048576K
2019-08-11T22:57:44.813-0300: 802522.470: [Full GC (Allocation Failure)
2019-08-11T22:57:44.813-0300: 802522.470: [CMS:
4718591K->4718591K(4718592K), 5.5944800 secs] 6029311K->6029311K(6029312K),
[Metaspace: 50496K->50496K(1097728K)], 5.5946363 secs] [Times: user=5.60
sys=0.00, real=5.59 secs]
Heap after GC invocations=34293 (full 254):
 par new generation   total 1310720K, used 1310719K [0x0000000640000000,
0x00000006a0000000, 0x00000006a0000000)
  eden space 1048576K,  99% used [0x0000000640000000, 0x000000067fffffe8,
0x0000000680000000)
  from space 262144K,  99% used [0x0000000690000000, 0x000000069fffff98,
0x00000006a0000000)
  to   space 262144K,   0% used [0x0000000680000000, 0x0000000680000000,
0x0000000690000000)
 concurrent mark-sweep generation total 4718592K, used 4718591K
[0x00000006a0000000, 0x00000007c0000000, 0x00000007c0000000)
 Metaspace       used 50496K, capacity 51788K, committed 53140K, reserved
1097728K
  class space    used 5001K, capacity 5263K, committed 5524K, reserved
1048576K
}
{Heap before GC invocations=34293 (full 254):
 par new generation   total 1310720K, used 1310719K [0x0000000640000000,
0x00000006a0000000, 0x00000006a0000000)
  eden space 1048576K,  99% used [0x0000000640000000, 0x000000067fffffe8,
0x0000000680000000)
  from space 262144K,  99% used [0x0000000690000000, 0x000000069fffff98,
0x00000006a0000000)
  to   space 262144K,   0% used [0x0000000680000000, 0x0000000680000000,
0x0000000690000000)
 concurrent mark-sweep generation total 4718592K, used 4718591K
[0x00000006a0000000, 0x00000007c0000000, 0x00000007c0000000)
 Metaspace       used 50496K, capacity 51788K, committed 53140K, reserved
1097728K
  class space    used 5001K, capacity 5263K, committed 5524K, reserved
1048576K
2019-08-11T22:57:50.408-0300: 802528.065: [Full GC (Allocation Failure)
2019-08-11T22:57:50.408-0300: 802528.065: [CMS:
4718591K->4718591K(4718592K), 5.5953203 secs] 6029311K->6029311K(6029312K),
[Metaspace: 50496K->50496K(1097728K)], 5.5954659 secs] [Times: user=5.60
sys=0.00, real=5.60 secs]
--------------

Em seg, 12 de ago de 2019 às 13:26, Shawn Heisey <apa...@elyograg.org>
escreveu:

> On 8/12/2019 5:47 AM, Kojo wrote:
> > I am using Solr cloud on this configuration:
> >
> > 2 boxes (one Solr in each box)
> > 4 instances per box
>
> Why are you running multiple instances on one server?  For most setups,
> this has too much overhead.  A single instance can handle many indexes.
> The only good reason I can think of to run multiple instances is when
> the amount of heap memory needed exceeds 31GB.  And even then, four
> instances seems excessive.  If you only have 300000 documents, there
> should be no reason for a super large heap.
>
> > At this moment I have an active collections with about 300.000 docs. The
> > other collections are not being queried. The acctive collection is
> > configured:
> > - shards: 16
> > - replication factor: 2
> >
> > These two Solrs (Solr1 and Solr2) use Zookeper (one box, one instance. No
> > zookeeper cluster)
> >
> > My application point to Solr1, and everything works fine, until suddenly
> on
> > instance of this Solr1 dies. This istance is on port 8983, the "main"
> > instance. I thought it could be related to memory usage, but we increase
> > RAM and JVM memory but it still dies.
> > The Solr1, the one wich dies,is the destination where I point my web
> > application.
>
> You will have to check the logs.  If Solr is not running on Windows,
> then any OutOfMemoryError exception, which can be caused by things other
> than a memory shortage, will result in Solr terminating itself.  On
> Windows, that functionality does not yet exist, so it would have to be
> Java or the OS that kills it.
>
> > Here I have two questions that I hope you can help me:
> >
> > 1. Which log can I look for debug this issue?
>
> Assuming you're NOT on Windows, check to see if there is a logfile named
> solr_oom_killer-8983.log in the logs directory where solr.log lives.  If
> there is, then that means the oom killer script was executed, and that
> happens when there is an OutOfMemoryError thrown.  The solr.log file
> MIGHT contain the OOME exception which will tell you what system
> resource was depleted.  If it was not heap memory that was depleted,
> then increasing memory probably won't help.
>
> If you share the gc log that Solr writes, we can analyze this to see if
> it was heap memory that was depleted.
>
> > 2. After this instance dies, the Solr cloud does not answer to my web
> > application. Is this correct? I thougth that the replicas should answer
> if
> > one shard, instance or one box goes down.
>
> If a Solr instance dies, you can't make connections directly to it.
> Connections would need to go to another instance.  You need a load
> balancer to handle that automatically, or a cloud-aware client.  The
> only cloud-aware client that I am sure about is the one for Java -- it
> is named SolrJ, created by the Solr project and distributed with Solr.
> I think that a third party MIGHT have written a cloud-aware client for
> Python, but I am not sure about this.
>
> If you set up a load balancer, you will need to handle redundancy for that.
>
> Side note:  A fully redundant zookeeper install needs three servers.  Do
> not put a load balancer in front of zookeeper.  The ZK protocol handles
> redundancy itself and a load balancer will break that.
>
> Thanks.
> Shawn
>

Reply via email to