Ok so here is the latest. http://imagebin.org/240392
I took a jmap on startup and one an hour after. http://pastebin.com/xEkWid4f I think the biggest deal is [B which may not be very helpful num #instances #bytes class name ---------------------------------------------- 1: 25094067 2319943656 [B 2: 23720125 1518088000 org.apache.hadoop.hdfs.server.namenode.INodeFile 3: 24460244 1174091712 org.apache.hadoop.hdfs.server.namenode.BlocksMap$BlockInfo 4: 25671649 1134707328 [Ljava.lang.Object; 5: 31106937 995421984 java.util.HashMap$Entry 6: 23725233 570829968 [Lorg.apache.hadoop.hdfs.server.namenode.BlocksMap$BlockInfo; 7: 2934 322685952 [Ljava.util.HashMap$Entry; num #instances #bytes class name ---------------------------------------------- 1: 24739690 3727511000 [B 2: 23280668 1489962752 org.apache.hadoop.hdfs.server.namenode.INodeFile 3: 24850044 1192802112 org.apache.hadoop.hdfs.server.namenode.BlocksMap$BlockInfo 4: 26124258 1157073272 [Ljava.lang.Object; 5: 32142057 1028545824 java.util.HashMap$Entry 6: 23307473 560625432 [Lorg.apache.hadoop.hdfs.server.namenode.BlocksMap$BlockInfo; GC starts like this: 3.204: [GC 102656K->9625K(372032K), 0.0150300 secs] 3.519: [GC 112281K->21180K(372032K), 0.0741210 secs] 3.883: [GC 123836K->30729K(372032K), 0.0208900 secs] 4.194: [GC 132724K->45785K(372032K), 0.0293860 secs] 4.522: [GC 148441K->59282K(372032K), 0.0341330 secs] 4.844: [GC 161938K->70071K(372032K), 0.0284850 secs] 5.139: [GC 172727K->80624K(372032K), 0.0171910 secs] 5.338: [GC 183280K->90661K(372032K), 0.0184200 secs] 5.549: [GC 193317K->103126K(372032K), 0.0430080 secs] 5.775: [GC 205782K->113534K(372032K), 0.0359480 secs] 5.995: [GC 216190K->122832K(372032K), 0.0192900 secs] 6.192: [GC 225488K->131777K(372032K), 0.0183870 secs] Then steadily increases 453.808: [GC 7482139K->7384396K(11240624K), 0.0208170 secs] 455.605: [GC 7487052K->7384177K(11240624K), 0.0206360 secs] 457.942: [GC 7486831K->7384131K(11240624K), 0.0189600 secs] 459.924: [GC 7486787K->7384141K(11240624K), 0.0193560 secs] 462.887: [GC 7486797K->7384151K(11240624K), 0.0189290 secs] Until I triggered this full gc a moment ago. 6266.988: [GC 11255823K->10373641K(17194656K), 0.0331910 secs] 6280.083: [GC 11259721K->10373499K(17194656K), 0.0324870 secs] 6293.706: [GC 11259579K->10376656K(17194656K), 0.0324330 secs] 6309.781: [GC 11262736K->10376110K(17194656K), 0.0310330 secs] 6333.790: [GC 11262190K->10374348K(17194656K), 0.0297670 secs] 6333.934: [Full GC 10391746K->9722532K(17194656K), 63.9812940 secs] 6418.466: [GC 10608612K->9725743K(17201024K), 0.0339610 secs] 6421.420: [GC 10611823K->9760611K(17201024K), 0.1501610 secs] 6428.221: [GC 10646691K->9767236K(17201024K), 0.1503170 secs] 6437.431: [GC 10653316K->9734750K(17201024K), 0.0344960 secs] Essentially gc sometimes clears some memory but not all and then the line keeps rising. Delta is about 10-17 hours until the heap is exhaused. On Sat, Dec 22, 2012 at 7:03 PM, Edward Capriolo <[email protected]>wrote: > Blocks is ~26,000,000 Files is a bit higher ~27,000,000 > > Currently running: > [root@hnn217 ~]# java -version > java version "1.7.0_09" > Was running 1.6.0_23 > > export JVM_OPTIONS="-XX:+UseCompressedOops -XX:+UseParNewGC > -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:SurvivorRatio=8 > -XX:MaxTenuringThreshold=1 -XX:CMSInitiatingOccupancyFraction=75 > -XX:+UseCMSInitiatingOccupancyOnly" > > I will grab the gc logs and the heap dump in a follow up. > > > > On Sat, Dec 22, 2012 at 1:32 PM, Suresh Srinivas > <[email protected]>wrote: > >> Please take a histo live dump when the memory is full. Note that this >> causes full gc. >> http://docs.oracle.com/javase/6/docs/technotes/tools/share/jmap.html >> >> What are the number of blocks you have on the system. >> >> Send the JVM options you are using. From earlier java versions which used >> 1/8 of total heap for young gen, it has gone upto 1/3 of total heap. This >> could also be the reason. >> >> Do you collect gc logs? Send that as well. >> >> Sent from a mobile device >> >> On Dec 22, 2012, at 9:51 AM, Edward Capriolo <[email protected]> >> wrote: >> >> > Newer 1.6 are getting close to 1.7 so I am not going to fear a number >> and >> > fight the future. >> > >> > I have been aat around 27 million files for a while been as high as 30 >> > million I do not think that is related. >> > >> > I do not think it is related to checkpoints but I am considering >> > raising/lowering the checkpoint triggers. >> > >> > On Saturday, December 22, 2012, Joep Rottinghuis < >> [email protected]> >> > wrote: >> >> Do your OOMs correlate with the secondary checkpointing? >> >> >> >> Joep >> >> >> >> Sent from my iPhone >> >> >> >> On Dec 22, 2012, at 7:42 AM, Michael Segel <[email protected]> >> > wrote: >> >> >> >>> Hey Silly question... >> >>> >> >>> How long have you had 27 million files? >> >>> >> >>> I mean can you correlate the number of files to the spat of OOMs? >> >>> >> >>> Even without problems... I'd say it would be a good idea to upgrade >> due >> > to the probability of a lot of code fixes... >> >>> >> >>> If you're running anything pre 1.x, going to 1.7 java wouldn't be a >> good >> > idea. Having said that... outside of MapR, have any of the distros >> > certified themselves on 1.7 yet? >> >>> >> >>> On Dec 22, 2012, at 6:54 AM, Edward Capriolo <[email protected]> >> > wrote: >> >>> >> >>>> I will give this a go. I have actually went in JMX and manually >> > triggered >> >>>> GC no memory is returned. So I assumed something was leaking. >> >>>> >> >>>> On Fri, Dec 21, 2012 at 11:59 PM, Adam Faris <[email protected]> >> > wrote: >> >>>> >> >>>>> I know this will sound odd, but try reducing your heap size. We >> had >> > an >> >>>>> issue like this where GC kept falling behind and we either ran out >> of >> > heap >> >>>>> or would be in full gc. By reducing heap, we were forcing >> concurrent >> > mark >> >>>>> sweep to occur and avoided both full GC and running out of heap >> space >> > as >> >>>>> the JVM would collect objects more frequently. >> >>>>> >> >>>>> On Dec 21, 2012, at 8:24 PM, Edward Capriolo <[email protected] >> > >> >>>>> wrote: >> >>>>> >> >>>>>> I have an old hadoop 0.20.2 cluster. Have not had any issues for a >> > while. >> >>>>>> (which is why I never bothered an upgrade) >> >>>>>> >> >>>>>> Suddenly it OOMed last week. Now the OOMs happen periodically. We >> > have a >> >>>>>> fairly large NameNode heap Xmx 17GB. It is a fairly large FS about >> >>>>>> 27,000,000 files. >> >>>>>> >> >>>>>> So the strangest thing is that every 1 and 1/2 hour the NN memory >> > usage >> >>>>>> increases until the heap is full. >> >>>>>> >> >>>>>> http://imagebin.org/240287 >> >>>>>> >> >>>>>> We tried failing over the NN to another machine. We change the Java >> >>>>> version >> >>>>>> from 1.6_23 -> 1.7.0. >> >>>>>> >> >>>>>> I have set the NameNode logs to debug and ALL and I have done the >> same >> >>>>> with >> >>>>>> the data nodes. >> >>>>>> Secondary NN is running and shipping edits and making new images. >> >>>>>> >> >>>>>> I am thinking something has corrupted the NN MetaData and after >> enough >> >>>>> time >> >>>>>> it becomes a time bomb, but this is just a total shot in the dark. >> > Does >> >>>>>> anyone have any interesting trouble shooting ideas? >> >> >> > >
