Hi To Patrick: Never mind .Thank you for your suggestion all the same. To Otis. We do not use SPM. We monintor the JVM just use jstat becasue my system went well before ,so we do not need other tools. But SPM is really awesome .
Still looking for help..... Best Regards 2016-03-18 6:01 GMT+08:00 Patrick Plaatje <pplaa...@gmail.com>: > Yeah, I did’t pay attention to the cached memory at all, my bad! > > I remember running into a similar situation a couple of years ago, one of > the things to investigate our memory profile was to produce a full heap > dump and manually analyse that using a tool like MAT. > > Cheers, > -patrick > > > > > On 17/03/2016, 21:58, "Otis Gospodnetić" <otis.gospodne...@gmail.com> > wrote: > > >Hi, > > > >On Wed, Mar 16, 2016 at 10:59 AM, Patrick Plaatje <pplaa...@gmail.com> > >wrote: > > > >> Hi, > >> > >> From the sar output you supplied, it looks like you might have a memory > >> issue on your hosts. The memory usage just before your crash seems to be > >> *very* close to 100%. Even the slightest increase (Solr itself, or > possibly > >> by a system service) could caused the system crash. What are the > >> specifications of your hosts and how much memory are you allocating? > > > > > >That's normal actually - http://www.linuxatemyram.com/ > > > >You *want* Linux to be using all your memory - you paid for it :) > > > >Otis > >-- > >Monitoring - Log Management - Alerting - Anomaly Detection > >Solr & Elasticsearch Consulting Support Training - http://sematext.com/ > > > > > > > > > >> > > > > > >> > >> > >> On 16/03/2016, 14:52, "YouPeng Yang" <yypvsxf19870...@gmail.com> wrote: > >> > >> >Hi > >> > It happened again,and worse thing is that my system went to crash.we > can > >> >even not connect to it with ssh. > >> > I use the sar command to capture the statistics information about > it.Here > >> >are my details: > >> > > >> > > >> >[1]cpu(by using sar -u),we have to restart our system just as the red > font > >> >LINUX RESTART in the logs. > >> > >> > >-------------------------------------------------------------------------------------------------- > >> >03:00:01 PM all 7.61 0.00 0.92 0.07 0.00 > >> >91.40 > >> >03:10:01 PM all 7.71 0.00 1.29 0.06 0.00 > >> >90.94 > >> >03:20:01 PM all 7.62 0.00 1.98 0.06 0.00 > >> >90.34 > >> >03:30:35 PM all 5.65 0.00 31.08 0.04 0.00 > >> >63.23 > >> >03:42:40 PM all 47.58 0.00 52.25 0.00 0.00 > >> > 0.16 > >> >Average: all 8.21 0.00 1.57 0.05 0.00 > >> >90.17 > >> > > >> >04:42:04 PM LINUX RESTART > >> > > >> >04:50:01 PM CPU %user %nice %system %iowait %steal > >> >%idle > >> >05:00:01 PM all 3.49 0.00 0.62 0.15 0.00 > >> >95.75 > >> >05:10:01 PM all 9.03 0.00 0.92 0.28 0.00 > >> >89.77 > >> >05:20:01 PM all 7.06 0.00 0.78 0.05 0.00 > >> >92.11 > >> >05:30:01 PM all 6.67 0.00 0.79 0.06 0.00 > >> >92.48 > >> >05:40:01 PM all 6.26 0.00 0.76 0.05 0.00 > >> >92.93 > >> >05:50:01 PM all 5.49 0.00 0.71 0.05 0.00 > >> >93.75 > >> > >> > >-------------------------------------------------------------------------------------------------- > >> > > >> >[2]mem(by using sar -r) > >> > >> > >-------------------------------------------------------------------------------------------------- > >> >03:00:01 PM 1519272 196633272 99.23 361112 76364340 143574212 > >> >47.77 > >> >03:10:01 PM 1451764 196700780 99.27 361196 76336340 143581608 > >> >47.77 > >> >03:20:01 PM 1453400 196699144 99.27 361448 76248584 143551128 > >> >47.76 > >> >03:30:35 PM 1513844 196638700 99.24 361648 76022016 143828244 > >> >47.85 > >> >03:42:40 PM 1481108 196671436 99.25 361676 75718320 144478784 > >> >48.07 > >> >Average: 5051607 193100937 97.45 362421 81775777 142758861 > >> >47.50 > >> > > >> >04:42:04 PM LINUX RESTART > >> > > >> >04:50:01 PM kbmemfree kbmemused %memused kbbuffers kbcached kbcommit > >> >%commit > >> >05:00:01 PM 154357132 43795412 22.10 92012 18648644 134950460 > >> >44.90 > >> >05:10:01 PM 136468244 61684300 31.13 219572 31709216 134966548 > >> >44.91 > >> >05:20:01 PM 135092452 63060092 31.82 221488 32162324 134949788 > >> >44.90 > >> >05:30:01 PM 133410464 64742080 32.67 233848 32793848 134976828 > >> >44.91 > >> >05:40:01 PM 132022052 66130492 33.37 235812 33278908 135007268 > >> >44.92 > >> >05:50:01 PM 130630408 67522136 34.08 237140 33900912 135099764 > >> >44.95 > >> >Average: 136996792 61155752 30.86 206645 30415642 134991776 > >> >44.91 > >> > >> > >-------------------------------------------------------------------------------------------------- > >> > > >> > > >> >As the blue font parts show that my hardware crash from 03:30:35.It is > >> hung > >> >up until I restart it manually at 04:42:04 > >> >ALl the above information just snapshot the performance when it crashed > >> >while there is nothing cover the reason.I have also > >> >check the /var/log/messages and find nothing useful. > >> > > >> >Note that I run the command- sar -v .It shows something abnormal: > >> > >> > >------------------------------------------------------------------------------------------------ > >> >02:50:01 PM 11542262 9216 76446 258 > >> >03:00:01 PM 11645526 9536 76421 258 > >> >03:10:01 PM 11748690 9216 76451 258 > >> >03:20:01 PM 11850191 9152 76331 258 > >> >03:30:35 PM 11972313 10112 132625 258 > >> >03:42:40 PM 12177319 13760 340227 258 > >> >Average: 8293601 8950 68187 161 > >> > > >> >04:42:04 PM LINUX RESTART > >> > > >> >04:50:01 PM dentunusd file-nr inode-nr pty-nr > >> >05:00:01 PM 35410 7616 35223 4 > >> >05:10:01 PM 137320 7296 42632 6 > >> >05:20:01 PM 247010 7296 42839 9 > >> >05:30:01 PM 358434 7360 42697 9 > >> >05:40:01 PM 471543 7040 42929 10 > >> >05:50:01 PM 583787 7296 42837 13 > >> > >> > >------------------------------------------------------------------------------------------------ > >> > > >> >and I check the man info about the -v option : > >> > >> > >------------------------------------------------------------------------------------------------ > >> >*-v* Report status of inode, file and other kernel tables. The > following > >> >values are displayed: > >> > *dentunusd* > >> >Number of unused cache entries in the directory cache. > >> >*file-nr* > >> >Number of file handles used by the system. > >> >*inode-nr* > >> >Number of inode handlers used by the system. > >> >*pty-nr* > >> >Number of pseudo-terminals used by the system. > >> > >> > >------------------------------------------------------------------------------------------------ > >> > > >> >Is the any clue about the crash? Would you please give me some > >> suggestions? > >> > > >> > > >> >Best Regards. > >> > > >> > > >> >2016-03-16 14:01 GMT+08:00 YouPeng Yang <yypvsxf19870...@gmail.com>: > >> > > >> >> Hello > >> >> The problem appears several times ,however I could not capture the > >> top > >> >> output .My script is as follows code. > >> >> I check the sys cpu usage whether it exceed 30%.the other metric > >> >> information can be dumpped successfully except the top . > >> >> Would you like to check my script that I am not able to figure out > what > >> is > >> >> wrong. > >> >> > >> >> > >> >> > >> > ------------------------------------------------------------------------------------------------------------------------------------------------------------------------- > >> >> #!/bin/bash > >> >> > >> >> while : > >> >> do > >> >> sysusage=$(mpstat 2 1 | grep -A 1 "%sys" | tail -n 1 | awk > '{if($6 < > >> >> 30) print 1; else print 0;}' ) > >> >> > >> >> if [ $sysusage -eq 0 ];then > >> >> #echo $sysusage > >> >> #perf record -o perf$(date +%Y%m%d%H%M%S).data -a -g -F 1000 > >> >> sleep 30 > >> >> file=$(date +%Y%m%d%H%M%S) > >> >> top -n 2 >> top$file.data > >> >> iotop -b -n 2 >> iotop$file.data > >> >> iostat >> iostat$file.data > >> >> netstat -an | awk '/^tcp/ {++state[$NF]} END {for(i in state) > >> >> print i,"\t",state[i]}' >> netstat$file.data > >> >> fi > >> >> sleep 5 > >> >> done > >> >> You have new mail in /var/spool/mail/root > >> >> > >> >> > >> >> > >> >> > >> > ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- > >> >> > >> >> 2016-03-08 21:39 GMT+08:00 YouPeng Yang <yypvsxf19870...@gmail.com>: > >> >> > >> >>> Hi all > >> >>> Thanks for your reply.I do some investigation for much time.and I > >> will > >> >>> post some logs of the 'top' and IO in a few days when the crash come > >> again. > >> >>> > >> >>> 2016-03-08 10:45 GMT+08:00 Shawn Heisey <apa...@elyograg.org>: > >> >>> > >> >>>> On 3/7/2016 2:23 AM, Toke Eskildsen wrote: > >> >>>> > How does this relate to YouPeng reporting that the CPU usage > >> increases? > >> >>>> > > >> >>>> > This is not a snark. YouPeng mentions kernel issues. It might > very > >> well > >> >>>> > be that IO is the real problem, but that it manifests in a > >> >>>> non-intuitive > >> >>>> > way. Before memory-mapping it was easy: Just look at IO-Wait. > Now I > >> am > >> >>>> > not so sure. Can high kernel load (Sy% in *nix top) indicate that > >> the > >> >>>> IO > >> >>>> > system is struggling, even if IO-Wait is low? > >> >>>> > >> >>>> It might turn out to be not directly related to memory, you're > right > >> >>>> about that. A very high query rate or particularly CPU-heavy > queries > >> or > >> >>>> analysis could cause high CPU usage even when memory is plentiful, > but > >> >>>> in that situation I would expect high user percentage, not kernel. > >> I'm > >> >>>> not completely sure what might cause high kernel usage if iowait is > >> low, > >> >>>> but no specific information was given about iowait. I've seen > iowait > >> >>>> percentages of 10% or less with problems clearly caused by iowait. > >> >>>> > >> >>>> With the available information (especially seeing 700GB of index > >> data), > >> >>>> I believe that the "not enough memory" scenario is more likely than > >> >>>> anything else. If the OP replies and says they have plenty of > memory, > >> >>>> then we can move on to the less common (IMHO) reasons for high CPU > >> with > >> >>>> a large index. > >> >>>> > >> >>>> If the OS is one that reports load average, I am curious what the 5 > >> >>>> minute average is, and how many real (non-HT) CPU cores there are. > >> >>>> > >> >>>> Thanks, > >> >>>> Shawn > >> >>>> > >> >>>> > >> >>> > >> >> > >> > >> > >