Hi Our system still goes down as times going.We found lots of threads are WAITING.Here is the threaddump that I copy from the web page.And 4 pictures for it. Is there any relationship with my problem?
https://www.dropbox.com/s/h3wyez091oouwck/threaddump?dl=0 https://www.dropbox.com/s/p3ctuxb3t1jgo2e/threaddump1.jpg?dl=0 https://www.dropbox.com/s/w0uy15h6z984ntw/threaddump2.jpg?dl=0 https://www.dropbox.com/s/0frskxdllxlz9ha/threaddump3.jpg?dl=0 https://www.dropbox.com/s/46ptnly1ngi9nb6/threaddump4.jpg?dl=0 Best Regards 2016-03-18 14:35 GMT+08:00 YouPeng Yang <yypvsxf19870...@gmail.com>: > Hi > To Patrick: Never mind .Thank you for your suggestion all the same. > To Otis. We do not use SPM. We monintor the JVM just use jstat becasue > my system went well before ,so we do not need other tools. > But SPM is really awesome . > > Still looking for help..... > > Best Regards > > 2016-03-18 6:01 GMT+08:00 Patrick Plaatje <pplaa...@gmail.com>: > >> Yeah, I did’t pay attention to the cached memory at all, my bad! >> >> I remember running into a similar situation a couple of years ago, one of >> the things to investigate our memory profile was to produce a full heap >> dump and manually analyse that using a tool like MAT. >> >> Cheers, >> -patrick >> >> >> >> >> On 17/03/2016, 21:58, "Otis Gospodnetić" <otis.gospodne...@gmail.com> >> wrote: >> >> >Hi, >> > >> >On Wed, Mar 16, 2016 at 10:59 AM, Patrick Plaatje <pplaa...@gmail.com> >> >wrote: >> > >> >> Hi, >> >> >> >> From the sar output you supplied, it looks like you might have a memory >> >> issue on your hosts. The memory usage just before your crash seems to >> be >> >> *very* close to 100%. Even the slightest increase (Solr itself, or >> possibly >> >> by a system service) could caused the system crash. What are the >> >> specifications of your hosts and how much memory are you allocating? >> > >> > >> >That's normal actually - http://www.linuxatemyram.com/ >> > >> >You *want* Linux to be using all your memory - you paid for it :) >> > >> >Otis >> >-- >> >Monitoring - Log Management - Alerting - Anomaly Detection >> >Solr & Elasticsearch Consulting Support Training - http://sematext.com/ >> > >> > >> > >> > >> >> >> > >> > >> >> >> >> >> >> On 16/03/2016, 14:52, "YouPeng Yang" <yypvsxf19870...@gmail.com> >> wrote: >> >> >> >> >Hi >> >> > It happened again,and worse thing is that my system went to crash.we >> can >> >> >even not connect to it with ssh. >> >> > I use the sar command to capture the statistics information about >> it.Here >> >> >are my details: >> >> > >> >> > >> >> >[1]cpu(by using sar -u),we have to restart our system just as the red >> font >> >> >LINUX RESTART in the logs. >> >> >> >> >> >-------------------------------------------------------------------------------------------------- >> >> >03:00:01 PM all 7.61 0.00 0.92 0.07 0.00 >> >> >91.40 >> >> >03:10:01 PM all 7.71 0.00 1.29 0.06 0.00 >> >> >90.94 >> >> >03:20:01 PM all 7.62 0.00 1.98 0.06 0.00 >> >> >90.34 >> >> >03:30:35 PM all 5.65 0.00 31.08 0.04 0.00 >> >> >63.23 >> >> >03:42:40 PM all 47.58 0.00 52.25 0.00 0.00 >> >> > 0.16 >> >> >Average: all 8.21 0.00 1.57 0.05 0.00 >> >> >90.17 >> >> > >> >> >04:42:04 PM LINUX RESTART >> >> > >> >> >04:50:01 PM CPU %user %nice %system %iowait %steal >> >> >%idle >> >> >05:00:01 PM all 3.49 0.00 0.62 0.15 0.00 >> >> >95.75 >> >> >05:10:01 PM all 9.03 0.00 0.92 0.28 0.00 >> >> >89.77 >> >> >05:20:01 PM all 7.06 0.00 0.78 0.05 0.00 >> >> >92.11 >> >> >05:30:01 PM all 6.67 0.00 0.79 0.06 0.00 >> >> >92.48 >> >> >05:40:01 PM all 6.26 0.00 0.76 0.05 0.00 >> >> >92.93 >> >> >05:50:01 PM all 5.49 0.00 0.71 0.05 0.00 >> >> >93.75 >> >> >> >> >> >-------------------------------------------------------------------------------------------------- >> >> > >> >> >[2]mem(by using sar -r) >> >> >> >> >> >-------------------------------------------------------------------------------------------------- >> >> >03:00:01 PM 1519272 196633272 99.23 361112 76364340 >> 143574212 >> >> >47.77 >> >> >03:10:01 PM 1451764 196700780 99.27 361196 76336340 >> 143581608 >> >> >47.77 >> >> >03:20:01 PM 1453400 196699144 99.27 361448 76248584 >> 143551128 >> >> >47.76 >> >> >03:30:35 PM 1513844 196638700 99.24 361648 76022016 >> 143828244 >> >> >47.85 >> >> >03:42:40 PM 1481108 196671436 99.25 361676 75718320 >> 144478784 >> >> >48.07 >> >> >Average: 5051607 193100937 97.45 362421 81775777 >> 142758861 >> >> >47.50 >> >> > >> >> >04:42:04 PM LINUX RESTART >> >> > >> >> >04:50:01 PM kbmemfree kbmemused %memused kbbuffers kbcached >> kbcommit >> >> >%commit >> >> >05:00:01 PM 154357132 43795412 22.10 92012 18648644 >> 134950460 >> >> >44.90 >> >> >05:10:01 PM 136468244 61684300 31.13 219572 31709216 >> 134966548 >> >> >44.91 >> >> >05:20:01 PM 135092452 63060092 31.82 221488 32162324 >> 134949788 >> >> >44.90 >> >> >05:30:01 PM 133410464 64742080 32.67 233848 32793848 >> 134976828 >> >> >44.91 >> >> >05:40:01 PM 132022052 66130492 33.37 235812 33278908 >> 135007268 >> >> >44.92 >> >> >05:50:01 PM 130630408 67522136 34.08 237140 33900912 >> 135099764 >> >> >44.95 >> >> >Average: 136996792 61155752 30.86 206645 30415642 >> 134991776 >> >> >44.91 >> >> >> >> >> >-------------------------------------------------------------------------------------------------- >> >> > >> >> > >> >> >As the blue font parts show that my hardware crash from 03:30:35.It is >> >> hung >> >> >up until I restart it manually at 04:42:04 >> >> >ALl the above information just snapshot the performance when it >> crashed >> >> >while there is nothing cover the reason.I have also >> >> >check the /var/log/messages and find nothing useful. >> >> > >> >> >Note that I run the command- sar -v .It shows something abnormal: >> >> >> >> >> >------------------------------------------------------------------------------------------------ >> >> >02:50:01 PM 11542262 9216 76446 258 >> >> >03:00:01 PM 11645526 9536 76421 258 >> >> >03:10:01 PM 11748690 9216 76451 258 >> >> >03:20:01 PM 11850191 9152 76331 258 >> >> >03:30:35 PM 11972313 10112 132625 258 >> >> >03:42:40 PM 12177319 13760 340227 258 >> >> >Average: 8293601 8950 68187 161 >> >> > >> >> >04:42:04 PM LINUX RESTART >> >> > >> >> >04:50:01 PM dentunusd file-nr inode-nr pty-nr >> >> >05:00:01 PM 35410 7616 35223 4 >> >> >05:10:01 PM 137320 7296 42632 6 >> >> >05:20:01 PM 247010 7296 42839 9 >> >> >05:30:01 PM 358434 7360 42697 9 >> >> >05:40:01 PM 471543 7040 42929 10 >> >> >05:50:01 PM 583787 7296 42837 13 >> >> >> >> >> >------------------------------------------------------------------------------------------------ >> >> > >> >> >and I check the man info about the -v option : >> >> >> >> >> >------------------------------------------------------------------------------------------------ >> >> >*-v* Report status of inode, file and other kernel tables. The >> following >> >> >values are displayed: >> >> > *dentunusd* >> >> >Number of unused cache entries in the directory cache. >> >> >*file-nr* >> >> >Number of file handles used by the system. >> >> >*inode-nr* >> >> >Number of inode handlers used by the system. >> >> >*pty-nr* >> >> >Number of pseudo-terminals used by the system. >> >> >> >> >> >------------------------------------------------------------------------------------------------ >> >> > >> >> >Is the any clue about the crash? Would you please give me some >> >> suggestions? >> >> > >> >> > >> >> >Best Regards. >> >> > >> >> > >> >> >2016-03-16 14:01 GMT+08:00 YouPeng Yang <yypvsxf19870...@gmail.com>: >> >> > >> >> >> Hello >> >> >> The problem appears several times ,however I could not capture >> the >> >> top >> >> >> output .My script is as follows code. >> >> >> I check the sys cpu usage whether it exceed 30%.the other metric >> >> >> information can be dumpped successfully except the top . >> >> >> Would you like to check my script that I am not able to figure out >> what >> >> is >> >> >> wrong. >> >> >> >> >> >> >> >> >> >> >> >> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------- >> >> >> #!/bin/bash >> >> >> >> >> >> while : >> >> >> do >> >> >> sysusage=$(mpstat 2 1 | grep -A 1 "%sys" | tail -n 1 | awk >> '{if($6 < >> >> >> 30) print 1; else print 0;}' ) >> >> >> >> >> >> if [ $sysusage -eq 0 ];then >> >> >> #echo $sysusage >> >> >> #perf record -o perf$(date +%Y%m%d%H%M%S).data -a -g -F >> 1000 >> >> >> sleep 30 >> >> >> file=$(date +%Y%m%d%H%M%S) >> >> >> top -n 2 >> top$file.data >> >> >> iotop -b -n 2 >> iotop$file.data >> >> >> iostat >> iostat$file.data >> >> >> netstat -an | awk '/^tcp/ {++state[$NF]} END {for(i in >> state) >> >> >> print i,"\t",state[i]}' >> netstat$file.data >> >> >> fi >> >> >> sleep 5 >> >> >> done >> >> >> You have new mail in /var/spool/mail/root >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- >> >> >> >> >> >> 2016-03-08 21:39 GMT+08:00 YouPeng Yang <yypvsxf19870...@gmail.com >> >: >> >> >> >> >> >>> Hi all >> >> >>> Thanks for your reply.I do some investigation for much time.and I >> >> will >> >> >>> post some logs of the 'top' and IO in a few days when the crash >> come >> >> again. >> >> >>> >> >> >>> 2016-03-08 10:45 GMT+08:00 Shawn Heisey <apa...@elyograg.org>: >> >> >>> >> >> >>>> On 3/7/2016 2:23 AM, Toke Eskildsen wrote: >> >> >>>> > How does this relate to YouPeng reporting that the CPU usage >> >> increases? >> >> >>>> > >> >> >>>> > This is not a snark. YouPeng mentions kernel issues. It might >> very >> >> well >> >> >>>> > be that IO is the real problem, but that it manifests in a >> >> >>>> non-intuitive >> >> >>>> > way. Before memory-mapping it was easy: Just look at IO-Wait. >> Now I >> >> am >> >> >>>> > not so sure. Can high kernel load (Sy% in *nix top) indicate >> that >> >> the >> >> >>>> IO >> >> >>>> > system is struggling, even if IO-Wait is low? >> >> >>>> >> >> >>>> It might turn out to be not directly related to memory, you're >> right >> >> >>>> about that. A very high query rate or particularly CPU-heavy >> queries >> >> or >> >> >>>> analysis could cause high CPU usage even when memory is >> plentiful, but >> >> >>>> in that situation I would expect high user percentage, not kernel. >> >> I'm >> >> >>>> not completely sure what might cause high kernel usage if iowait >> is >> >> low, >> >> >>>> but no specific information was given about iowait. I've seen >> iowait >> >> >>>> percentages of 10% or less with problems clearly caused by iowait. >> >> >>>> >> >> >>>> With the available information (especially seeing 700GB of index >> >> data), >> >> >>>> I believe that the "not enough memory" scenario is more likely >> than >> >> >>>> anything else. If the OP replies and says they have plenty of >> memory, >> >> >>>> then we can move on to the less common (IMHO) reasons for high CPU >> >> with >> >> >>>> a large index. >> >> >>>> >> >> >>>> If the OS is one that reports load average, I am curious what the >> 5 >> >> >>>> minute average is, and how many real (non-HT) CPU cores there are. >> >> >>>> >> >> >>>> Thanks, >> >> >>>> Shawn >> >> >>>> >> >> >>>> >> >> >>> >> >> >> >> >> >> >> >> >> >