Erick and Shawn, thank you very much for the very usefull information. When I start to move from sigle Solr to cloud, I was planning to use the cluster for very large collections.
But the collection that I said, will not grow that much, so I will downsize shards. Thanks for the information about load balancing. I will provide it. Shawn, bellow I share the information that I hope will clarify. Linux CentOS Solr 6.6 64 Gb each box 6 Gb each node The last time the node died was on 2019-08-11. It happens sometimes a week. -------------- tail -f node1/logs/solr_oom_killer-8983-2019-08-11_22_57_56.log Running OOM killer script for process 38788 for Solr on port 8983 Killed process 38788 -------------- -------------- ls -ltr node1/logs/archived/ total 82032 -rw-rw-r-- 1 solr solr 20973030 Aug 4 18:31 solr_gc.log.0 -rw-rw-r-- 1 solr solr 20973415 Aug 6 21:05 solr_gc.log.1 -rw-rw-r-- 1 solr solr 20971714 Aug 9 12:01 solr_gc.log.2 -rw-rw-r-- 1 solr solr 20971720 Aug 11 22:53 solr_gc.log.3 -rw-rw-r-- 1 solr solr 77096 Aug 11 22:57 solr_gc.log.4.current -rw-rw-r-- 1 solr solr 364 Aug 11 22:57 solr-8983-console.log -------------- -------------- tail -50 node1/logs/archived/solr_gc.log.4.current Metaspace used 50496K, capacity 51788K, committed 53140K, reserved 1097728K class space used 5001K, capacity 5263K, committed 5524K, reserved 1048576K } 2019-08-11T22:57:39.231-0300: 802516.887: Total time for which application threads were stopped: 12.5386815 seconds, Stopping threads took: 0.0001242 seconds {Heap before GC invocations=34291 (full 252): par new generation total 1310720K, used 1310719K [0x0000000640000000, 0x00000006a0000000, 0x00000006a0000000) eden space 1048576K, 100% used [0x0000000640000000, 0x0000000680000000, 0x0000000680000000) from space 262144K, 99% used [0x0000000690000000, 0x000000069ffffff8, 0x00000006a0000000) to space 262144K, 0% used [0x0000000680000000, 0x0000000680000000, 0x0000000690000000) concurrent mark-sweep generation total 4718592K, used 4718592K [0x00000006a0000000, 0x00000007c0000000, 0x00000007c0000000) Metaspace used 50496K, capacity 51788K, committed 53140K, reserved 1097728K class space used 5001K, capacity 5263K, committed 5524K, reserved 1048576K 2019-08-11T22:57:39.233-0300: 802516.889: [Full GC (Allocation Failure) 2019-08-11T22:57:39.233-0300: 802516.889: [CMS: 4718592K->4718591K(4718592K), 5.5779385 secs] 6029311K->6029311K(6029312K), [Metaspace: 50496K->50496K(1097728K)], 5.5780863 secs] [Times: user=5.58 sys=0.00, real=5.58 secs] Heap after GC invocations=34292 (full 253): par new generation total 1310720K, used 1310719K [0x0000000640000000, 0x00000006a0000000, 0x00000006a0000000) eden space 1048576K, 99% used [0x0000000640000000, 0x000000067fffff68, 0x0000000680000000) from space 262144K, 99% used [0x0000000690000000, 0x000000069fffff18, 0x00000006a0000000) to space 262144K, 0% used [0x0000000680000000, 0x0000000680000000, 0x0000000690000000) concurrent mark-sweep generation total 4718592K, used 4718591K [0x00000006a0000000, 0x00000007c0000000, 0x00000007c0000000) Metaspace used 50496K, capacity 51788K, committed 53140K, reserved 1097728K class space used 5001K, capacity 5263K, committed 5524K, reserved 1048576K } 2019-08-11T22:57:44.812-0300: 802522.469: Total time for which application threads were stopped: 5.5805500 seconds, Stopping threads took: 0.0001295 seconds {Heap before GC invocations=34292 (full 253): par new generation total 1310720K, used 1310719K [0x0000000640000000, 0x00000006a0000000, 0x00000006a0000000) eden space 1048576K, 100% used [0x0000000640000000, 0x0000000680000000, 0x0000000680000000) from space 262144K, 99% used [0x0000000690000000, 0x000000069fffff98, 0x00000006a0000000) to space 262144K, 0% used [0x0000000680000000, 0x0000000680000000, 0x0000000690000000) concurrent mark-sweep generation total 4718592K, used 4718591K [0x00000006a0000000, 0x00000007c0000000, 0x00000007c0000000) Metaspace used 50496K, capacity 51788K, committed 53140K, reserved 1097728K class space used 5001K, capacity 5263K, committed 5524K, reserved 1048576K 2019-08-11T22:57:44.813-0300: 802522.470: [Full GC (Allocation Failure) 2019-08-11T22:57:44.813-0300: 802522.470: [CMS: 4718591K->4718591K(4718592K), 5.5944800 secs] 6029311K->6029311K(6029312K), [Metaspace: 50496K->50496K(1097728K)], 5.5946363 secs] [Times: user=5.60 sys=0.00, real=5.59 secs] Heap after GC invocations=34293 (full 254): par new generation total 1310720K, used 1310719K [0x0000000640000000, 0x00000006a0000000, 0x00000006a0000000) eden space 1048576K, 99% used [0x0000000640000000, 0x000000067fffffe8, 0x0000000680000000) from space 262144K, 99% used [0x0000000690000000, 0x000000069fffff98, 0x00000006a0000000) to space 262144K, 0% used [0x0000000680000000, 0x0000000680000000, 0x0000000690000000) concurrent mark-sweep generation total 4718592K, used 4718591K [0x00000006a0000000, 0x00000007c0000000, 0x00000007c0000000) Metaspace used 50496K, capacity 51788K, committed 53140K, reserved 1097728K class space used 5001K, capacity 5263K, committed 5524K, reserved 1048576K } {Heap before GC invocations=34293 (full 254): par new generation total 1310720K, used 1310719K [0x0000000640000000, 0x00000006a0000000, 0x00000006a0000000) eden space 1048576K, 99% used [0x0000000640000000, 0x000000067fffffe8, 0x0000000680000000) from space 262144K, 99% used [0x0000000690000000, 0x000000069fffff98, 0x00000006a0000000) to space 262144K, 0% used [0x0000000680000000, 0x0000000680000000, 0x0000000690000000) concurrent mark-sweep generation total 4718592K, used 4718591K [0x00000006a0000000, 0x00000007c0000000, 0x00000007c0000000) Metaspace used 50496K, capacity 51788K, committed 53140K, reserved 1097728K class space used 5001K, capacity 5263K, committed 5524K, reserved 1048576K 2019-08-11T22:57:50.408-0300: 802528.065: [Full GC (Allocation Failure) 2019-08-11T22:57:50.408-0300: 802528.065: [CMS: 4718591K->4718591K(4718592K), 5.5953203 secs] 6029311K->6029311K(6029312K), [Metaspace: 50496K->50496K(1097728K)], 5.5954659 secs] [Times: user=5.60 sys=0.00, real=5.60 secs] -------------- Em seg, 12 de ago de 2019 às 13:26, Shawn Heisey <apa...@elyograg.org> escreveu: > On 8/12/2019 5:47 AM, Kojo wrote: > > I am using Solr cloud on this configuration: > > > > 2 boxes (one Solr in each box) > > 4 instances per box > > Why are you running multiple instances on one server? For most setups, > this has too much overhead. A single instance can handle many indexes. > The only good reason I can think of to run multiple instances is when > the amount of heap memory needed exceeds 31GB. And even then, four > instances seems excessive. If you only have 300000 documents, there > should be no reason for a super large heap. > > > At this moment I have an active collections with about 300.000 docs. The > > other collections are not being queried. The acctive collection is > > configured: > > - shards: 16 > > - replication factor: 2 > > > > These two Solrs (Solr1 and Solr2) use Zookeper (one box, one instance. No > > zookeeper cluster) > > > > My application point to Solr1, and everything works fine, until suddenly > on > > instance of this Solr1 dies. This istance is on port 8983, the "main" > > instance. I thought it could be related to memory usage, but we increase > > RAM and JVM memory but it still dies. > > The Solr1, the one wich dies,is the destination where I point my web > > application. > > You will have to check the logs. If Solr is not running on Windows, > then any OutOfMemoryError exception, which can be caused by things other > than a memory shortage, will result in Solr terminating itself. On > Windows, that functionality does not yet exist, so it would have to be > Java or the OS that kills it. > > > Here I have two questions that I hope you can help me: > > > > 1. Which log can I look for debug this issue? > > Assuming you're NOT on Windows, check to see if there is a logfile named > solr_oom_killer-8983.log in the logs directory where solr.log lives. If > there is, then that means the oom killer script was executed, and that > happens when there is an OutOfMemoryError thrown. The solr.log file > MIGHT contain the OOME exception which will tell you what system > resource was depleted. If it was not heap memory that was depleted, > then increasing memory probably won't help. > > If you share the gc log that Solr writes, we can analyze this to see if > it was heap memory that was depleted. > > > 2. After this instance dies, the Solr cloud does not answer to my web > > application. Is this correct? I thougth that the replicas should answer > if > > one shard, instance or one box goes down. > > If a Solr instance dies, you can't make connections directly to it. > Connections would need to go to another instance. You need a load > balancer to handle that automatically, or a cloud-aware client. The > only cloud-aware client that I am sure about is the one for Java -- it > is named SolrJ, created by the Solr project and distributed with Solr. > I think that a third party MIGHT have written a cloud-aware client for > Python, but I am not sure about this. > > If you set up a load balancer, you will need to handle redundancy for that. > > Side note: A fully redundant zookeeper install needs three servers. Do > not put a load balancer in front of zookeeper. The ZK protocol handles > redundancy itself and a load balancer will break that. > > Thanks. > Shawn >