Hi
  Our system still goes down as times going.We found lots of threads are
WAITING.Here is the threaddump that I copy from the web page.And 4 pictures
for it.
  Is there any relationship with my problem?


https://www.dropbox.com/s/h3wyez091oouwck/threaddump?dl=0
https://www.dropbox.com/s/p3ctuxb3t1jgo2e/threaddump1.jpg?dl=0
https://www.dropbox.com/s/w0uy15h6z984ntw/threaddump2.jpg?dl=0
https://www.dropbox.com/s/0frskxdllxlz9ha/threaddump3.jpg?dl=0
https://www.dropbox.com/s/46ptnly1ngi9nb6/threaddump4.jpg?dl=0


Best Regards

2016-03-18 14:35 GMT+08:00 YouPeng Yang <yypvsxf19870...@gmail.com>:

> Hi
>   To Patrick: Never mind .Thank you for your suggestion all the same.
>   To Otis. We do not use SPM. We monintor the JVM just use jstat becasue
> my system went well before ,so we do not need  other tools.
> But SPM is really awesome .
>
>   Still looking for help.....
>
> Best Regards
>
> 2016-03-18 6:01 GMT+08:00 Patrick Plaatje <pplaa...@gmail.com>:
>
>> Yeah, I did’t pay attention to the cached memory at all, my bad!
>>
>> I remember running into a similar situation a couple of years ago, one of
>> the things to investigate our memory profile was to produce a full heap
>> dump and manually analyse that using a tool like MAT.
>>
>> Cheers,
>> -patrick
>>
>>
>>
>>
>> On 17/03/2016, 21:58, "Otis Gospodnetić" <otis.gospodne...@gmail.com>
>> wrote:
>>
>> >Hi,
>> >
>> >On Wed, Mar 16, 2016 at 10:59 AM, Patrick Plaatje <pplaa...@gmail.com>
>> >wrote:
>> >
>> >> Hi,
>> >>
>> >> From the sar output you supplied, it looks like you might have a memory
>> >> issue on your hosts. The memory usage just before your crash seems to
>> be
>> >> *very* close to 100%. Even the slightest increase (Solr itself, or
>> possibly
>> >> by a system service) could caused the system crash. What are the
>> >> specifications of your hosts and how much memory are you allocating?
>> >
>> >
>> >That's normal actually - http://www.linuxatemyram.com/
>> >
>> >You *want* Linux to be using all your memory - you paid for it :)
>> >
>> >Otis
>> >--
>> >Monitoring - Log Management - Alerting - Anomaly Detection
>> >Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>> >
>> >
>> >
>> >
>> >>
>> >
>> >
>> >>
>> >>
>> >> On 16/03/2016, 14:52, "YouPeng Yang" <yypvsxf19870...@gmail.com>
>> wrote:
>> >>
>> >> >Hi
>> >> > It happened again,and worse thing is that my system went to crash.we
>> can
>> >> >even not connect to it with ssh.
>> >> > I use the sar command to capture the statistics information about
>> it.Here
>> >> >are my details:
>> >> >
>> >> >
>> >> >[1]cpu(by using sar -u),we have to restart our system just as the red
>> font
>> >> >LINUX RESTART in the logs.
>> >>
>> >>
>> >--------------------------------------------------------------------------------------------------
>> >> >03:00:01 PM     all      7.61      0.00      0.92      0.07      0.00
>> >> >91.40
>> >> >03:10:01 PM     all      7.71      0.00      1.29      0.06      0.00
>> >> >90.94
>> >> >03:20:01 PM     all      7.62      0.00      1.98      0.06      0.00
>> >> >90.34
>> >> >03:30:35 PM     all      5.65      0.00     31.08      0.04      0.00
>> >> >63.23
>> >> >03:42:40 PM     all     47.58      0.00     52.25      0.00      0.00
>> >> > 0.16
>> >> >Average:        all      8.21      0.00      1.57      0.05      0.00
>> >> >90.17
>> >> >
>> >> >04:42:04 PM       LINUX RESTART
>> >> >
>> >> >04:50:01 PM     CPU     %user     %nice   %system   %iowait    %steal
>> >> >%idle
>> >> >05:00:01 PM     all      3.49      0.00      0.62      0.15      0.00
>> >> >95.75
>> >> >05:10:01 PM     all      9.03      0.00      0.92      0.28      0.00
>> >> >89.77
>> >> >05:20:01 PM     all      7.06      0.00      0.78      0.05      0.00
>> >> >92.11
>> >> >05:30:01 PM     all      6.67      0.00      0.79      0.06      0.00
>> >> >92.48
>> >> >05:40:01 PM     all      6.26      0.00      0.76      0.05      0.00
>> >> >92.93
>> >> >05:50:01 PM     all      5.49      0.00      0.71      0.05      0.00
>> >> >93.75
>> >>
>> >>
>> >--------------------------------------------------------------------------------------------------
>> >> >
>> >> >[2]mem(by using sar -r)
>> >>
>> >>
>> >--------------------------------------------------------------------------------------------------
>> >> >03:00:01 PM   1519272 196633272     99.23    361112  76364340
>> 143574212
>> >> >47.77
>> >> >03:10:01 PM   1451764 196700780     99.27    361196  76336340
>> 143581608
>> >> >47.77
>> >> >03:20:01 PM   1453400 196699144     99.27    361448  76248584
>> 143551128
>> >> >47.76
>> >> >03:30:35 PM   1513844 196638700     99.24    361648  76022016
>> 143828244
>> >> >47.85
>> >> >03:42:40 PM   1481108 196671436     99.25    361676  75718320
>> 144478784
>> >> >48.07
>> >> >Average:      5051607 193100937     97.45    362421  81775777
>> 142758861
>> >> >47.50
>> >> >
>> >> >04:42:04 PM       LINUX RESTART
>> >> >
>> >> >04:50:01 PM kbmemfree kbmemused  %memused kbbuffers  kbcached
>> kbcommit
>> >> >%commit
>> >> >05:00:01 PM 154357132  43795412     22.10     92012  18648644
>> 134950460
>> >> >44.90
>> >> >05:10:01 PM 136468244  61684300     31.13    219572  31709216
>> 134966548
>> >> >44.91
>> >> >05:20:01 PM 135092452  63060092     31.82    221488  32162324
>> 134949788
>> >> >44.90
>> >> >05:30:01 PM 133410464  64742080     32.67    233848  32793848
>> 134976828
>> >> >44.91
>> >> >05:40:01 PM 132022052  66130492     33.37    235812  33278908
>> 135007268
>> >> >44.92
>> >> >05:50:01 PM 130630408  67522136     34.08    237140  33900912
>> 135099764
>> >> >44.95
>> >> >Average:    136996792  61155752     30.86    206645  30415642
>> 134991776
>> >> >44.91
>> >>
>> >>
>> >--------------------------------------------------------------------------------------------------
>> >> >
>> >> >
>> >> >As the blue font parts show that my hardware crash from 03:30:35.It is
>> >> hung
>> >> >up until I restart it manually at 04:42:04
>> >> >ALl the above information just snapshot the performance when it
>> crashed
>> >> >while there is nothing cover the reason.I have also
>> >> >check the /var/log/messages and find nothing useful.
>> >> >
>> >> >Note that I run the command- sar -v .It shows something abnormal:
>> >>
>> >>
>> >------------------------------------------------------------------------------------------------
>> >> >02:50:01 PM  11542262      9216     76446       258
>> >> >03:00:01 PM  11645526      9536     76421       258
>> >> >03:10:01 PM  11748690      9216     76451       258
>> >> >03:20:01 PM  11850191      9152     76331       258
>> >> >03:30:35 PM  11972313     10112    132625       258
>> >> >03:42:40 PM  12177319     13760    340227       258
>> >> >Average:      8293601      8950     68187       161
>> >> >
>> >> >04:42:04 PM       LINUX RESTART
>> >> >
>> >> >04:50:01 PM dentunusd   file-nr  inode-nr    pty-nr
>> >> >05:00:01 PM     35410      7616     35223         4
>> >> >05:10:01 PM    137320      7296     42632         6
>> >> >05:20:01 PM    247010      7296     42839         9
>> >> >05:30:01 PM    358434      7360     42697         9
>> >> >05:40:01 PM    471543      7040     42929        10
>> >> >05:50:01 PM    583787      7296     42837        13
>> >>
>> >>
>> >------------------------------------------------------------------------------------------------
>> >> >
>> >> >and I check the man info about the -v option :
>> >>
>> >>
>> >------------------------------------------------------------------------------------------------
>> >> >*-v*  Report status of inode, file and other kernel tables.  The
>> following
>> >> >values are displayed:
>> >> >       *dentunusd*
>> >> >Number of unused cache entries in the directory cache.
>> >> >*file-nr*
>> >> >Number of file handles used by the system.
>> >> >*inode-nr*
>> >> >Number of inode handlers used by the system.
>> >> >*pty-nr*
>> >> >Number of pseudo-terminals used by the system.
>> >>
>> >>
>> >------------------------------------------------------------------------------------------------
>> >> >
>> >> >Is the any clue about the crash? Would you please give me some
>> >> suggestions?
>> >> >
>> >> >
>> >> >Best Regards.
>> >> >
>> >> >
>> >> >2016-03-16 14:01 GMT+08:00 YouPeng Yang <yypvsxf19870...@gmail.com>:
>> >> >
>> >> >> Hello
>> >> >>    The problem appears several times ,however I could not capture
>> the
>> >> top
>> >> >> output .My script is as follows code.
>> >> >> I check the sys cpu usage whether it exceed 30%.the other metric
>> >> >> information can be dumpped successfully except the top .
>> >> >> Would you like to check my script that I am not able to figure out
>> what
>> >> is
>> >> >> wrong.
>> >> >>
>> >> >>
>> >> >>
>> >>
>> -------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>> >> >> #!/bin/bash
>> >> >>
>> >> >> while :
>> >> >>   do
>> >> >>     sysusage=$(mpstat 2 1 | grep -A 1 "%sys" | tail -n 1 | awk
>> '{if($6 <
>> >> >> 30) print 1; else print 0;}' )
>> >> >>
>> >> >>     if [ $sysusage -eq 0 ];then
>> >> >>         #echo $sysusage
>> >> >>         #perf record -o perf$(date +%Y%m%d%H%M%S).data  -a -g -F
>> 1000
>> >> >> sleep 30
>> >> >>         file=$(date +%Y%m%d%H%M%S)
>> >> >>         top -n 2 >> top$file.data
>> >> >>         iotop -b -n 2  >> iotop$file.data
>> >> >>         iostat >> iostat$file.data
>> >> >>         netstat -an | awk '/^tcp/ {++state[$NF]} END {for(i in
>> state)
>> >> >> print i,"\t",state[i]}' >> netstat$file.data
>> >> >>     fi
>> >> >>     sleep 5
>> >> >>   done
>> >> >> You have new mail in /var/spool/mail/root
>> >> >>
>> >> >>
>> >> >>
>> >> >>
>> >>
>> -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>> >> >>
>> >> >> 2016-03-08 21:39 GMT+08:00 YouPeng Yang <yypvsxf19870...@gmail.com
>> >:
>> >> >>
>> >> >>> Hi all
>> >> >>>   Thanks for your reply.I do some investigation for much time.and I
>> >> will
>> >> >>> post some logs of the 'top' and IO in a few days when the crash
>> come
>> >> again.
>> >> >>>
>> >> >>> 2016-03-08 10:45 GMT+08:00 Shawn Heisey <apa...@elyograg.org>:
>> >> >>>
>> >> >>>> On 3/7/2016 2:23 AM, Toke Eskildsen wrote:
>> >> >>>> > How does this relate to YouPeng reporting that the CPU usage
>> >> increases?
>> >> >>>> >
>> >> >>>> > This is not a snark. YouPeng mentions kernel issues. It might
>> very
>> >> well
>> >> >>>> > be that IO is the real problem, but that it manifests in a
>> >> >>>> non-intuitive
>> >> >>>> > way. Before memory-mapping it was easy: Just look at IO-Wait.
>> Now I
>> >> am
>> >> >>>> > not so sure. Can high kernel load (Sy% in *nix top) indicate
>> that
>> >> the
>> >> >>>> IO
>> >> >>>> > system is struggling, even if IO-Wait is low?
>> >> >>>>
>> >> >>>> It might turn out to be not directly related to memory, you're
>> right
>> >> >>>> about that.  A very high query rate or particularly CPU-heavy
>> queries
>> >> or
>> >> >>>> analysis could cause high CPU usage even when memory is
>> plentiful, but
>> >> >>>> in that situation I would expect high user percentage, not kernel.
>> >> I'm
>> >> >>>> not completely sure what might cause high kernel usage if iowait
>> is
>> >> low,
>> >> >>>> but no specific information was given about iowait.  I've seen
>> iowait
>> >> >>>> percentages of 10% or less with problems clearly caused by iowait.
>> >> >>>>
>> >> >>>> With the available information (especially seeing 700GB of index
>> >> data),
>> >> >>>> I believe that the "not enough memory" scenario is more likely
>> than
>> >> >>>> anything else.  If the OP replies and says they have plenty of
>> memory,
>> >> >>>> then we can move on to the less common (IMHO) reasons for high CPU
>> >> with
>> >> >>>> a large index.
>> >> >>>>
>> >> >>>> If the OS is one that reports load average, I am curious what the
>> 5
>> >> >>>> minute average is, and how many real (non-HT) CPU cores there are.
>> >> >>>>
>> >> >>>> Thanks,
>> >> >>>> Shawn
>> >> >>>>
>> >> >>>>
>> >> >>>
>> >> >>
>> >>
>> >>
>>
>>
>

Reply via email to