Yesterday, we upgraded our 40 node cluster from solr 7.6.0 to solr
7.7.0. This morning, all the nodes are using 1200+% of CPU. It looks
like it's in garbage collection. We did reduce our HDFS cache size from
11G to 6G, but other than that, no other parameters were changes.
Top shows:
top - 09:33:40 up 14 days, 23:22, 1 user, load average: 21.08, 20.97,
20.69
Tasks: 617 total, 1 running, 616 sleeping, 0 stopped, 0 zombie
Cpu(s): 60.9%us, 0.6%sy, 12.9%ni, 25.0%id, 0.6%wa, 0.0%hi, 0.1%si,
0.0%st
Mem: 99015716k total, 98627164k used, 388552k free, 187892k buffers
Swap: 33554428k total, 152900k used, 33401528k free, 17086000k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
39586 solr 20 0 42.0g 25g 13m S 1275.4 27.0 12935:57 java
Earlier yesterday when we first fired it up, it was taking int he 4
second range, now it's in the 60 second range:
[Times: user=0.86 sys=0.02, real=0.06 secs]
2019-02-12T09:16:20.740-0500: 62300.935: Total time for which
application threads were stopped: 0.0638071 seconds, Stopping threads
took: 0.0003274 seconds
2019-02-12T09:16:21.216-0500: 62301.411: Total time for which
application threads were stopped: 0.0011279 seconds, Stopping threads
took: 0.0003174 seconds
2019-02-12T09:16:21.805-0500: 62301.999: Total time for which
application threads were stopped: 0.0010905 seconds, Stopping threads
took: 0.0003168 seconds
2019-02-12T09:16:22.175-0500: 62302.369: Total time for which
application threads were stopped: 0.0011335 seconds, Stopping threads
took: 0.0002635 seconds
2019-02-12T09:16:25.249-0500: 62305.444: Total time for which
application threads were stopped: 0.0014435 seconds, Stopping threads
took: 0.0004507 seconds
2019-02-12T09:16:26.254-0500: 62306.449: Total time for which
application threads were stopped: 0.0011393 seconds, Stopping threads
took: 0.0003034 seconds
2019-02-12T09:16:26.434-0500: 62306.629: Total time for which
application threads were stopped: 0.0011275 seconds, Stopping threads
took: 0.0002968 seconds
2019-02-12T09:16:55.540-0500: 62335.735: Total time for which
application threads were stopped: 0.0012155 seconds, Stopping threads
took: 0.0003023 seconds
2019-02-12T09:16:55.542-0500: 62335.737: Total time for which
application threads were stopped: 0.0007474 seconds, Stopping threads
took: 0.0002297 seconds
2019-02-12T09:16:55.543-0500: 62335.737: Total time for which
application threads were stopped: 0.0005585 seconds, Stopping threads
took: 0.0001380 seconds
2019-02-12T09:16:55.543-0500: 62335.738: Total time for which
application threads were stopped: 0.0006552 seconds, Stopping threads
took: 0.0001907 seconds
2019-02-12T09:17:00.553-0500: 62340.747: Total time for which
application threads were stopped: 0.0011875 seconds, Stopping threads
took: 0.0002851 seconds
2019-02-12T09:17:00.553-0500: 62340.748: Total time for which
application threads were stopped: 0.0004486 seconds, Stopping threads
took: 0.0000693 seconds
2019-02-12T09:17:00.554-0500: 62340.749: Total time for which
application threads were stopped: 0.0006729 seconds, Stopping threads
took: 0.0002604 seconds
2019-02-12T09:17:00.555-0500: 62340.749: Total time for which
application threads were stopped: 0.0006151 seconds, Stopping threads
took: 0.0002310 seconds
2019-02-12T09:17:00.556-0500: 62340.750: Total time for which
application threads were stopped: 0.0006812 seconds, Stopping threads
took: 0.0003049 seconds
2019-02-12T09:17:00.557-0500: 62340.752: Total time for which
application threads were stopped: 0.0015586 seconds, Stopping threads
took: 0.0002212 seconds
2019-02-12T09:17:09.901-0500: 62350.095: Total time for which
application threads were stopped: 0.0013066 seconds, Stopping threads
took: 0.0004304 seconds
2019-02-12T09:17:10.195-0500: 62350.390: Total time for which
application threads were stopped: 0.0011006 seconds, Stopping threads
took: 0.0002573 seconds
{Heap before GC invocations=791 (full 0):
garbage-first heap total 19922944K, used 17336353K
[0x0000000240000000, 0x0000000241002600, 0x00000007c0000000)
region size 16384K, 357 young (5849088K), 19 survivors (311296K)
Metaspace used 90791K, capacity 93476K, committed 93696K,
reserved 1132544K
class space used 10929K, capacity 11487K, committed 11520K,
reserved 1048576K
2019-02-12T09:17:10.288-0500: 62350.482: [GC pause (G1 Evacuation Pause)
(young)
Desired survivor size 377487360 bytes, new threshold 15 (max 15)
- age 1: 105613200 bytes, 105613200 total
- age 2: 57166352 bytes, 162779552 total
- age 3: 25396792 bytes, 188176344 total
- age 4: 17067152 bytes, 205243496 total
- age 5: 32136016 bytes, 237379512 total
- age 6: 9642888 bytes, 247022400 total
- age 7: 5979880 bytes, 253002280 total
- age 8: 7996344 bytes, 260998624 total
- age 9: 3940072 bytes, 264938696 total
- age 10: 9720808 bytes, 274659504 total
- age 11: 8797424 bytes, 283456928 total
- age 12: 2599280 bytes, 286056208 total
- age 13: 9197304 bytes, 295253512 total
- age 14: 2616704 bytes, 297870216 total
- age 15: 8565352 bytes, 306435568 total
, 0.0540664 secs]
[Parallel Time: 47.3 ms, GC Workers: 16]
[GC Worker Start (ms): Min: 62350482.7, Avg: 62350482.8, Max:
62350483.0, Diff: 0.3]
[Ext Root Scanning (ms): Min: 0.6, Avg: 0.9, Max: 2.6, Diff: 2.0,
Sum: 14.7]
[Update RS (ms): Min: 1.3, Avg: 2.8, Max: 3.1, Diff: 1.8, Sum: 44.2]
[Processed Buffers: Min: 2, Avg: 9.8, Max: 26, Diff: 24, Sum: 157]
[Scan RS (ms): Min: 0.0, Avg: 0.1, Max: 0.2, Diff: 0.1, Sum: 1.1]
[Code Root Scanning (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff:
0.0, Sum: 0.0]
[Object Copy (ms): Min: 42.9, Avg: 43.0, Max: 43.1, Diff: 0.2,
Sum: 688.3]
[Termination (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.1]
[Termination Attempts: Min: 1, Avg: 1.1, Max: 2, Diff: 1, Sum: 17]
[GC Worker Other (ms): Min: 0.0, Avg: 0.1, Max: 0.2, Diff: 0.1,
Sum: 1.2]
[GC Worker Total (ms): Min: 46.6, Avg: 46.9, Max: 47.0, Diff:
0.4, Sum: 749.6]
[GC Worker End (ms): Min: 62350529.6, Avg: 62350529.7, Max:
62350529.8, Diff: 0.1]
[Code Root Fixup: 0.0 ms]
[Code Root Purge: 0.0 ms]
[Clear CT: 0.7 ms]
[Other: 6.0 ms]
[Choose CSet: 0.0 ms]
[Ref Proc: 3.8 ms]
[Ref Enq: 0.4 ms]
[Redirty Cards: 0.3 ms]
[Humongous Register: 0.2 ms]
[Humongous Reclaim: 0.1 ms]
[Free CSet: 0.7 ms]
[Eden: 5408.0M(5408.0M)->0.0B(5376.0M) Survivors: 304.0M->320.0M
Heap: 16.5G(19.0G)->11.3G(19.0G)]
Heap after GC invocations=792 (full 0):
garbage-first heap total 19922944K, used 11821182K
[0x0000000240000000, 0x0000000241002600, 0x00000007c0000000)
region size 16384K, 20 young (327680K), 20 survivors (327680K)
Metaspace used 90791K, capacity 93476K, committed 93696K,
reserved 1132544K
class space used 10929K, capacity 11487K, committed 11520K,
reserved 1048576K
}
[Times: user=0.77 sys=0.01, real=0.06 secs]
Parameters are:
GC_TUNE="-XX:+UseG1GC \
-XX:MaxDirectMemorySize=6g \
-XX:+PerfDisableSharedMem \
-XX:+ParallelRefProcEnabled \
-XX:G1HeapRegionSize=16m \
-XX:MaxGCPauseMillis=300 \
-XX:InitiatingHeapOccupancyPercent=75 \
-XX:+UseLargePages \
-XX:ParallelGCThreads=16 \
-XX:-ResizePLAB \
-XX:+AggressiveOpts"
Anything I can try / change?
Thank you!
-Joe