Thank you, I read under the memory footprint, I set 75% recovery, memory occupancy at about 76%, the other we zookeeper not on a dedicated server, perhaps because of this cause instability.
What else do you recommend for me to check? 2018-02-27 22:37 GMT+08:00 Emir Arnautović <emir.arnauto...@sematext.com>: > This does not show much: only that your heap is around 75% (24-25GB). I > was thinking that you should compare metrics (heap/GC as well) when running > on without issues and when running with issues and see if something can be > concluded. > About instability: Do you run ZK on dedicated nodes? > > Emir > -- > Monitoring - Log Management - Alerting - Anomaly Detection > Solr & Elasticsearch Consulting Support Training - http://sematext.com/ > > > > > On 27 Feb 2018, at 14:43, 苗海泉 <mseaspr...@gmail.com> wrote: > > > > Thank you, we were 49 shard 49 nodes, but later found that in this case, > > often disconnect between solr and zookeepr, zookeeper too many nodes > caused > > solr instability, so reduced to 25 A follow-up performance can not keep > up > > also need to increase back. > > > > Very slow when solr and zookeeper not found any errors, just build the > > index slow, automatic commit inside the log display is slow, but the main > > reason may not lie in the commit place. > > > > I am sorry, I do not know how to look at the utilization of java heap, > > through the gc log, gc time is not long, I posted the log: > > > > > > {Heap before GC invocations=1144021 (full 72): > > garbage-first heap total 33554432K, used 26982419K [0x00007f1478000000, > > 0x00007f1478808000, 0x00007f1c78000000) > > region size 8192K, 204 young (1671168K), 26 survivors (212992K) > > Metaspace used 41184K, capacity 41752K, committed 67072K, reserved > > 67584K > > 2018-02-27T21:43:01.793+0800: 4668016.044: [GC pause (G1 Evacuation > Pause) > > (young) > > Desired survivor size 109051904 bytes, new threshold 1 (max 15) > > - age 1: 113878760 bytes, 113878760 total > > - age 2: 21264744 bytes, 135143504 total > > - age 3: 17020096 bytes, 152163600 total > > - age 4: 26870864 bytes, 179034464 total > > , 0.0579794 secs] > > [Parallel Time: 46.9 ms, GC Workers: 18] > > [GC Worker Start (ms): Min: 4668016046.1, Avg: 4668016046.3, Max: > > 4668016046.4, Diff: 0.3] > > [Ext Root Scanning (ms): Min: 2.4, Avg: 6.5, Max: 46.3, Diff: 43.9, > > Sum: 116.9] > > [Update RS (ms): Min: 0.0, Avg: 3.4, Max: 6.0, Diff: 6.0, Sum: 62.0] > > [Processed Buffers: Min: 0, Avg: 6.3, Max: 16, Diff: 16, Sum: > 113] > > [Scan RS (ms): Min: 0.0, Avg: 0.0, Max: 0.1, Diff: 0.1, Sum: 0.5] > > [Code Root Scanning (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, > > Sum: 0.0] > > [Object Copy (ms): Min: 0.1, Avg: 23.8, Max: 25.5, Diff: 25.5, Sum: > > 428.1] > > [Termination (ms): Min: 0.0, Avg: 12.7, Max: 13.5, Diff: 13.5, Sum: > > 228.9] > > [Termination Attempts: Min: 1, Avg: 1.0, Max: 1, Diff: 0, Sum: > 18] > > [GC Worker Other (ms): Min: 0.0, Avg: 0.1, Max: 0.4, Diff: 0.4, Sum: > > 1.2] > > [GC Worker Total (ms): Min: 46.4, Avg: 46.6, Max: 46.7, Diff: 0.3, > > Sum: 838.0] > > [GC Worker End (ms): Min: 4668016092.8, Avg: 4668016092.8, Max: > > 4668016092.8, Diff: 0.0] > > [Code Root Fixup: 0.2 ms] > > [Code Root Purge: 0.0 ms] > > [Clear CT: 0.3 ms] > > [Other: 10.7 ms] > > [Choose CSet: 0.0 ms] > > [Ref Proc: 5.9 ms] > > [Ref Enq: 0.2 ms] > > [Redirty Cards: 0.2 ms] > > [Humongous Register: 2.2 ms] > > [Humongous Reclaim: 0.4 ms] > > [Free CSet: 0.4 ms] > > [Eden: 1424.0M(1424.0M)->0.0B(1552.0M) Survivors: 208.0M->80.0M Heap: > > 25.7G(32.0G)->24.3G(32.0G)] > > Heap after GC invocations=1144022 (full 72): > > garbage-first heap total 33554432K, used 25489656K [0x00007f1478000000, > > 0x00007f1478808000, 0x00007f1c78000000) > > region size 8192K, 10 young (81920K), 10 survivors (81920K) > > Metaspace used 41184K, capacity 41752K, committed 67072K, reserved > > 67584K > > } > > [Times: user=0.84 sys=0.01, real=0.05 secs] > > 2018-02-27T21:43:01.851+0800: 4668016.102: Total time for which > application > > threads were stopped: 0.0661383 seconds, Stopping threads took: 0.0004141 > > seconds > > 2018-02-27T21:43:02.092+0800: 4668016.343: [GC concurrent-mark-end, > > 2.5757061 secs] > > 2018-02-27T21:43:02.100+0800: 4668016.351: [GC remark > > 2018-02-27T21:43:02.100+0800: 4668016.351: [Finalize Marking, 0.0016508 > > secs] 2018-02-27T21:43:02.102+0800: 4668016.352: [GC ref-proc, 0.0277818 > > secs] 2018-02-27T21:43:02.129+0800: 4668016.380: [Unloading, 0.0118102 > > secs], 0.0704296 secs] > > [Times: user=0.85 sys=0.04, real=0.07 secs] > > 2018-02-27T21:43:02.171+0800: 4668016.422: Total time for which > application > > threads were stopped: 0.0785762 seconds, Stopping threads took: 0.0006159 > > seconds > > 2018-02-27T21:43:02.178+0800: 4668016.429: [GC cleanup 24G->24G(32G), > > 0.0391915 secs] > > [Times: user=0.64 sys=0.00, real=0.04 secs] > > 2018-02-27T21:43:02.218+0800: 4668016.469: Total time for which > application > > threads were stopped: 0.0470020 seconds, Stopping threads took: 0.0001684 > > seconds > > 2018-02-27T21:43:02.540+0800: 4668016.791: Total time for which > application > > threads were stopped: 0.0074829 seconds, Stopping threads took: 0.0004834 > > seconds > > {Heap before GC invocations=1144023 (full 72): > > garbage-first heap total 33554432K, used 27078904K [0x00007f1478000000, > > 0x00007f1478808000, 0x00007f1c78000000) > > region size 8192K, 204 young (1671168K), 10 survivors (81920K) > > Metaspace used 41184K, capacity 41752K, committed 67072K, reserved > > 67584K > > 2018-02-27T21:43:04.076+0800: 4668018.326: [GC pause (G1 Evacuation > Pause) > > (young) > > Desired survivor size 109051904 bytes, new threshold 15 (max 15) > > - age 1: 47719032 bytes, 47719032 total > > , 0.0554183 secs] > > [Parallel Time: 48.0 ms, GC Workers: 18] > > [GC Worker Start (ms): Min: 4668018329.0, Avg: 4668018329.1, Max: > > 4668018329.3, Diff: 0.3] > > [Ext Root Scanning (ms): Min: 2.9, Avg: 5.7, Max: 47.4, Diff: 44.6, > > Sum: 103.0] > > [Update RS (ms): Min: 0.0, Avg: 14.3, Max: 16.2, Diff: 16.2, Sum: > > 257.6] > > [Processed Buffers: Min: 0, Avg: 17.4, Max: 22, Diff: 22, Sum: > 314] > > [Scan RS (ms): Min: 0.0, Avg: 0.0, Max: 0.1, Diff: 0.1, Sum: 0.5] > > [Code Root Scanning (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, > > Sum: 0.0] > > [Object Copy (ms): Min: 0.1, Avg: 10.9, Max: 11.9, Diff: 11.8, Sum: > > 196.9] > > [Termination (ms): Min: 0.0, Avg: 16.6, Max: 17.6, Diff: 17.6, Sum: > > 299.1] > > [Termination Attempts: Min: 1, Avg: 1.0, Max: 1, Diff: 0, Sum: > 18] > > [GC Worker Other (ms): Min: 0.0, Avg: 0.0, Max: 0.1, Diff: 0.0, Sum: > > 0.5] > > [GC Worker Total (ms): Min: 47.5, Avg: 47.6, Max: 47.8, Diff: 0.3, > > Sum: 857.6] > > [GC Worker End (ms): Min: 4668018376.7, Avg: 4668018376.8, Max: > > 4668018376.8, Diff: 0.0] > > [Code Root Fixup: 0.2 ms] > > [Code Root Purge: 0.0 ms] > > [Clear CT: 0.2 ms] > > [Other: 7.1 ms] > > [Choose CSet: 0.0 ms] > > [Ref Proc: 2.3 ms] > > [Ref Enq: 0.2 ms] > > [Redirty Cards: 0.2 ms] > > [Humongous Register: 2.2 ms] > > [Humongous Reclaim: 0.4 ms] > > [Free CSet: 0.4 ms] > > [Eden: 1552.0M(1552.0M)->0.0B(1488.0M) Survivors: 80.0M->144.0M Heap: > > 25.8G(32.0G)->24.4G(32.0G)] > > Heap after GC invocations=1144024 (full 72): > > garbage-first heap total 33554432K, used 25550050K [0x00007f1478000000, > > 0x00007f1478808000, 0x00007f1c78000000) > > region size 8192K, 18 young (147456K), 18 survivors (147456K) > > Metaspace used 41184K, capacity 41752K, committed 67072K, reserved > > 67584K > > } > > [Times: user=0.82 sys=0.00, real=0.05 secs] > > > > > > > > > > 2018-02-27 20:58 GMT+08:00 Emir Arnautović <emir.arnauto...@sematext.com > >: > > > >> Ah, so there are ~560 shards per node and not all nodes are indexing at > >> the same time. Why is that? You can have better throughput if indexing > on > >> all nodes. If happy with shard size, you can create new collection with > 49 > >> shards every 2h and have everything the same and index on all nodes. > >> > >> Back to main question: what is the heap utilisation? When you restart > node > >> what is heap utilisation? Do you see any errors in your logs? Do you see > >> any errors in ZK logs? > >> > >> Emir > >> -- > >> Monitoring - Log Management - Alerting - Anomaly Detection > >> Solr & Elasticsearch Consulting Support Training - http://sematext.com/ > >> > >> > >> > >>> On 27 Feb 2018, at 13:22, 苗海泉 <mseaspr...@gmail.com> wrote: > >>> > >>> Thanks for you reply again. > >>> I just said that you may have some misunderstanding, we have 49 solr > >> nodes, > >>> each collection has 25 shards, each shard has only one replica of the > >> data, > >>> there is no copy, and I reduce the part of the cache. If you need the > >>> metric data, I can check Come out to tell you, in addition we are only > >>> additional system, there will not be any change action. > >>> > >>> 2018-02-27 20:05 GMT+08:00 Emir Arnautović < > emir.arnauto...@sematext.com > >>> : > >>> > >>>> Hi, > >>>> It is hard to tell without looking more into your metrics. It seems to > >> me > >>>> that you are reaching limits of your cluster. I would doublecheck if > >> memory > >>>> is the issue. If I got it right, you have ~1120 shards per node. It > >> takes > >>>> some heap just to keep them open. If you have some caches enabled and > >> if it > >>>> is append only system, old shards will keep caches until reloaded. > >>>> Probably will not make much diff, but with 25x2=50 shards and 49 > nodes, > >>>> one node will need to handle double indexing load. > >>>> > >>>> Emir > >>>> -- > >>>> Monitoring - Log Management - Alerting - Anomaly Detection > >>>> Solr & Elasticsearch Consulting Support Training - > http://sematext.com/ > >>>> > >>>> > >>>> > >>>>> On 27 Feb 2018, at 12:54, 苗海泉 <mseaspr...@gmail.com> wrote: > >>>>> > >>>>> In addition, we found that the rate was normal when the number of > >>>>> collections was kept below 936 and the speed was slower and slower at > >>>> 984. > >>>>> Therefore, we could only temporarily delete the older collection, but > >> now > >>>>> we need more Online collection, there has been no good way to confuse > >> us > >>>>> for a long time, very much hope to give a solution to the problem of > >>>> ideas, > >>>>> greatly appreciated > >>>>> > >>>>> 2018-02-27 19:46 GMT+08:00 苗海泉 <mseaspr...@gmail.com>: > >>>>> > >>>>>> Thank you for reply. > >>>>>> One collection has 25 shard one replica, one solr node has about 5T > on > >>>>>> desk. > >>>>>> GC is checked ,and modify as follow : > >>>>>> SOLR_JAVA_MEM="-Xms32768m -Xmx32768m " > >>>>>> GC_TUNE=" \ > >>>>>> -XX:+UseG1GC \ > >>>>>> -XX:+PerfDisableSharedMem \ > >>>>>> -XX:+ParallelRefProcEnabled \ > >>>>>> -XX:G1HeapRegionSize=8m \ > >>>>>> -XX:MaxGCPauseMillis=250 \ > >>>>>> -XX:InitiatingHeapOccupancyPercent=75 \ > >>>>>> -XX:+UseLargePages \ > >>>>>> -XX:+AggressiveOpts \ > >>>>>> -XX:+UseLargePages" > >>>>>> > >>>>>> 2018-02-27 19:27 GMT+08:00 Emir Arnautović < > >>>> emir.arnauto...@sematext.com>: > >>>>>> > >>>>>>> Hi, > >>>>>>> To get more complete picture, can you tell us how many > >> shards/replicas > >>>> do > >>>>>>> you have per collection? Also what is index size on disk? Did you > >>>> check GC? > >>>>>>> > >>>>>>> BTW, using 32GB heap prevents you from using compressed oops, > >> resulting > >>>>>>> in less memory available than 31GB. > >>>>>>> > >>>>>>> Thanks, > >>>>>>> Emir > >>>>>>> -- > >>>>>>> Monitoring - Log Management - Alerting - Anomaly Detection > >>>>>>> Solr & Elasticsearch Consulting Support Training - > >>>> http://sematext.com/ > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>>> On 27 Feb 2018, at 11:36, 苗海泉 <mseaspr...@gmail.com> wrote: > >>>>>>>> > >>>>>>>> I encountered a more serious problem in the process of using solr. > >> We > >>>>>>> use > >>>>>>>> the solr version is 6.0, our daily amount of data is about 500 > >> billion > >>>>>>>> documents, create a collection every hour, the online collection > of > >>>> more > >>>>>>>> than a thousand, 49 solr nodes. If the collection in less than > 800, > >>>> the > >>>>>>>> speed is still very fast, if the collection of the number of 1100 > or > >>>> so, > >>>>>>>> the construction of solr index will drop sharply, one of the > >> original > >>>>>>>> program speed of about 2-3 million TPS, Dropped to only a few > >> hundred > >>>> or > >>>>>>>> even tens of TPS, who have encountered a similar situation, there > is > >>>> no > >>>>>>>> good idea to find this issue. By the way, solr a node memory we > >>>> assigned > >>>>>>>> 32G,We checked the memory, cpu, disk IO, network IO occupancy is > no > >>>>>>>> problem, belong to the normal state. Which friend encountered a > >>>> similar > >>>>>>>> problem, please inform the solution, thank you very much. > >>>>>>> > >>>>>>> > >>>>>> > >>>>>> > >>>>>> -- > >>>>>> ============================== > >>>>>> 联创科技 > >>>>>> 知行如一 > >>>>>> ============================== > >>>>>> > >>>>> > >>>>> > >>>>> > >>>>> -- > >>>>> ============================== > >>>>> 联创科技 > >>>>> 知行如一 > >>>>> ============================== > >>>> > >>>> > >>> > >>> > >>> -- > >>> ============================== > >>> 联创科技 > >>> 知行如一 > >>> ============================== > >> > >> > > > > > > -- > > ============================== > > 联创科技 > > 知行如一 > > ============================== > > -- ============================== 联创科技 知行如一 ==============================