Yesterday, we upgraded our 40 node cluster from solr 7.6.0 to solr 7.7.0.  This morning, all the nodes are using 1200+% of CPU. It looks like it's in garbage collection.  We did reduce our HDFS cache size from 11G to 6G, but other than that, no other parameters were changes.

Top shows:

top - 09:33:40 up 14 days, 23:22,  1 user,  load average: 21.08, 20.97, 20.69
Tasks: 617 total,   1 running, 616 sleeping,   0 stopped,   0 zombie
Cpu(s): 60.9%us,  0.6%sy, 12.9%ni, 25.0%id,  0.6%wa,  0.0%hi, 0.1%si,  0.0%st
Mem:  99015716k total, 98627164k used,   388552k free,   187892k buffers
Swap: 33554428k total,   152900k used, 33401528k free, 17086000k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+ COMMAND
39586 solr      20   0 42.0g  25g  13m S 1275.4 27.0  12935:57 java

Earlier yesterday when we first fired it up, it was taking int he 4 second range, now it's in the 60 second range:

 [Times: user=0.86 sys=0.02, real=0.06 secs]
2019-02-12T09:16:20.740-0500: 62300.935: Total time for which application threads were stopped: 0.0638071 seconds, Stopping threads took: 0.0003274 seconds 2019-02-12T09:16:21.216-0500: 62301.411: Total time for which application threads were stopped: 0.0011279 seconds, Stopping threads took: 0.0003174 seconds 2019-02-12T09:16:21.805-0500: 62301.999: Total time for which application threads were stopped: 0.0010905 seconds, Stopping threads took: 0.0003168 seconds 2019-02-12T09:16:22.175-0500: 62302.369: Total time for which application threads were stopped: 0.0011335 seconds, Stopping threads took: 0.0002635 seconds 2019-02-12T09:16:25.249-0500: 62305.444: Total time for which application threads were stopped: 0.0014435 seconds, Stopping threads took: 0.0004507 seconds 2019-02-12T09:16:26.254-0500: 62306.449: Total time for which application threads were stopped: 0.0011393 seconds, Stopping threads took: 0.0003034 seconds 2019-02-12T09:16:26.434-0500: 62306.629: Total time for which application threads were stopped: 0.0011275 seconds, Stopping threads took: 0.0002968 seconds 2019-02-12T09:16:55.540-0500: 62335.735: Total time for which application threads were stopped: 0.0012155 seconds, Stopping threads took: 0.0003023 seconds 2019-02-12T09:16:55.542-0500: 62335.737: Total time for which application threads were stopped: 0.0007474 seconds, Stopping threads took: 0.0002297 seconds 2019-02-12T09:16:55.543-0500: 62335.737: Total time for which application threads were stopped: 0.0005585 seconds, Stopping threads took: 0.0001380 seconds 2019-02-12T09:16:55.543-0500: 62335.738: Total time for which application threads were stopped: 0.0006552 seconds, Stopping threads took: 0.0001907 seconds 2019-02-12T09:17:00.553-0500: 62340.747: Total time for which application threads were stopped: 0.0011875 seconds, Stopping threads took: 0.0002851 seconds 2019-02-12T09:17:00.553-0500: 62340.748: Total time for which application threads were stopped: 0.0004486 seconds, Stopping threads took: 0.0000693 seconds 2019-02-12T09:17:00.554-0500: 62340.749: Total time for which application threads were stopped: 0.0006729 seconds, Stopping threads took: 0.0002604 seconds 2019-02-12T09:17:00.555-0500: 62340.749: Total time for which application threads were stopped: 0.0006151 seconds, Stopping threads took: 0.0002310 seconds 2019-02-12T09:17:00.556-0500: 62340.750: Total time for which application threads were stopped: 0.0006812 seconds, Stopping threads took: 0.0003049 seconds 2019-02-12T09:17:00.557-0500: 62340.752: Total time for which application threads were stopped: 0.0015586 seconds, Stopping threads took: 0.0002212 seconds 2019-02-12T09:17:09.901-0500: 62350.095: Total time for which application threads were stopped: 0.0013066 seconds, Stopping threads took: 0.0004304 seconds 2019-02-12T09:17:10.195-0500: 62350.390: Total time for which application threads were stopped: 0.0011006 seconds, Stopping threads took: 0.0002573 seconds
{Heap before GC invocations=791 (full 0):
 garbage-first heap   total 19922944K, used 17336353K [0x0000000240000000, 0x0000000241002600, 0x00000007c0000000)
  region size 16384K, 357 young (5849088K), 19 survivors (311296K)
 Metaspace       used 90791K, capacity 93476K, committed 93696K, reserved 1132544K   class space    used 10929K, capacity 11487K, committed 11520K, reserved 1048576K 2019-02-12T09:17:10.288-0500: 62350.482: [GC pause (G1 Evacuation Pause) (young)
Desired survivor size 377487360 bytes, new threshold 15 (max 15)
- age   1:  105613200 bytes,  105613200 total
- age   2:   57166352 bytes,  162779552 total
- age   3:   25396792 bytes,  188176344 total
- age   4:   17067152 bytes,  205243496 total
- age   5:   32136016 bytes,  237379512 total
- age   6:    9642888 bytes,  247022400 total
- age   7:    5979880 bytes,  253002280 total
- age   8:    7996344 bytes,  260998624 total
- age   9:    3940072 bytes,  264938696 total
- age  10:    9720808 bytes,  274659504 total
- age  11:    8797424 bytes,  283456928 total
- age  12:    2599280 bytes,  286056208 total
- age  13:    9197304 bytes,  295253512 total
- age  14:    2616704 bytes,  297870216 total
- age  15:    8565352 bytes,  306435568 total
, 0.0540664 secs]
   [Parallel Time: 47.3 ms, GC Workers: 16]
      [GC Worker Start (ms): Min: 62350482.7, Avg: 62350482.8, Max: 62350483.0, Diff: 0.3]       [Ext Root Scanning (ms): Min: 0.6, Avg: 0.9, Max: 2.6, Diff: 2.0, Sum: 14.7]
      [Update RS (ms): Min: 1.3, Avg: 2.8, Max: 3.1, Diff: 1.8, Sum: 44.2]
         [Processed Buffers: Min: 2, Avg: 9.8, Max: 26, Diff: 24, Sum: 157]
      [Scan RS (ms): Min: 0.0, Avg: 0.1, Max: 0.2, Diff: 0.1, Sum: 1.1]
      [Code Root Scanning (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.0]       [Object Copy (ms): Min: 42.9, Avg: 43.0, Max: 43.1, Diff: 0.2, Sum: 688.3]
      [Termination (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.1]
         [Termination Attempts: Min: 1, Avg: 1.1, Max: 2, Diff: 1, Sum: 17]
      [GC Worker Other (ms): Min: 0.0, Avg: 0.1, Max: 0.2, Diff: 0.1, Sum: 1.2]       [GC Worker Total (ms): Min: 46.6, Avg: 46.9, Max: 47.0, Diff: 0.4, Sum: 749.6]       [GC Worker End (ms): Min: 62350529.6, Avg: 62350529.7, Max: 62350529.8, Diff: 0.1]
   [Code Root Fixup: 0.0 ms]
   [Code Root Purge: 0.0 ms]
   [Clear CT: 0.7 ms]
   [Other: 6.0 ms]
      [Choose CSet: 0.0 ms]
      [Ref Proc: 3.8 ms]
      [Ref Enq: 0.4 ms]
      [Redirty Cards: 0.3 ms]
      [Humongous Register: 0.2 ms]
      [Humongous Reclaim: 0.1 ms]
      [Free CSet: 0.7 ms]
   [Eden: 5408.0M(5408.0M)->0.0B(5376.0M) Survivors: 304.0M->320.0M Heap: 16.5G(19.0G)->11.3G(19.0G)]
Heap after GC invocations=792 (full 0):
 garbage-first heap   total 19922944K, used 11821182K [0x0000000240000000, 0x0000000241002600, 0x00000007c0000000)
  region size 16384K, 20 young (327680K), 20 survivors (327680K)
 Metaspace       used 90791K, capacity 93476K, committed 93696K, reserved 1132544K   class space    used 10929K, capacity 11487K, committed 11520K, reserved 1048576K
}
 [Times: user=0.77 sys=0.01, real=0.06 secs]

Parameters are:

GC_TUNE="-XX:+UseG1GC \
-XX:MaxDirectMemorySize=6g \
-XX:+PerfDisableSharedMem \
-XX:+ParallelRefProcEnabled \
-XX:G1HeapRegionSize=16m \
-XX:MaxGCPauseMillis=300 \
-XX:InitiatingHeapOccupancyPercent=75 \
-XX:+UseLargePages \
-XX:ParallelGCThreads=16 \
-XX:-ResizePLAB \
-XX:+AggressiveOpts"


Anything I can try / change?

Thank you!

-Joe

Reply via email to