Hey Peter, The following log:
2012-03-15 19:36:16,645 INFO org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Registered with ResourceManager as torb1pnb002.dataraker.net:35665 with total resource of memory: 1000 Indicates the issue. Your NM has only 1000mb to serve as resources, cause of your yarn.nodemanager.resource.memory-mb configuration. Your job, however, with AM + Task Containers demand > 1000 (1024 by default, see MAPREDUCE-4026). Hence the resource requests can't be granted by this NM, and the RM can't find another NM that may accept such high requests. Quick fix is to simply raise yarn.nodemanager.resource.memory-mb back to its default of 8192 (8 GB) or higher. On Fri, Mar 16, 2012 at 1:22 AM, Peter Naudus <[email protected]> wrote: > Hello all, > > When submitting an Hbase export job to YARN, I see it appearing on the web > UI but for some reason, the job never starts; it constantly stays at 0% > complete. I am using hadoop 0.23 and hbase 0.92 ( CDH4 beta 1 ) > > I see the NodeManagers connecting to the ResourceManager: > > 2012-03-15 19:36:10,585 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: > torb1pnb001.dataraker.net:46696 Node Transitioned from NEW to RUNNING > 2012-03-15 19:36:16,633 INFO org.apache.hadoop.yarn.util.RackResolver: > Resolved torb1pnb002.dataraker.net to /default-rack > 2012-03-15 19:36:16,633 INFO > org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService: > NodeManager from node torb1pnb002.dataraker.net(cmPort: 35665 httpPort: > 9999) registered with capability: 1000, assigned nodeId > torb1pnb002.dataraker.net:35665 > 2012-03-15 19:36:16,634 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: > torb1pnb002.dataraker.net:35665 Node Transitioned from NEW to RUNNING > [ etc... ] > > and the job being submitted to the ResourceManager: > > 2012-03-15 19:40:29,248 INFO > org.apache.hadoop.yarn.server.resourcemanager.ClientRMService: Allocated new > applicationId: 1 > 2012-03-15 19:40:31,323 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: > application_1331840162147_0001 State change from NEW to SUBMITTED > 2012-03-15 19:40:31,323 INFO > org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: > Registering appattempt_1331840162147_0001_000001 > 2012-03-15 19:40:31,323 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: > appattempt_1331840162147_0001_000001 State change from NEW to SUBMITTED > 2012-03-15 19:40:31,327 INFO > org.apache.hadoop.yarn.server.resourcemanager.ClientRMService: Application > with id 1 submitted by user hdfs with application_id [..snip..] > 2012-03-15 19:40:31,329 INFO > org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=hdfs > IP=10.192.16.64 OPERATION=Submit Application Request > TARGET=ClientRMService RESULT=SUCCESS > APPID=application_1331840162147_0001 > 2012-03-15 19:40:31,333 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.FifoScheduler: > Application Submission: application_1331840162147_0001 from hdfs, currently > active: 1 > 2012-03-15 19:40:31,336 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: > appattempt_1331840162147_0001_000001 State change from SUBMITTED to > SCHEDULED > 2012-03-15 19:40:31,336 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: > application_1331840162147_0001 State change from SUBMITTED to ACCEPTED > > but after the NodeManager starts, the log never indicates any requests from > the ResourceManager > > 2012-03-15 19:36:16,604 INFO > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Connected > to ResourceManager at torb1pna001:8025 > 2012-03-15 19:36:16,645 INFO > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Registered > with ResourceManager as torb1pnb002.dataraker.net:35665 with total resource > of memory: 1000 > 2012-03-15 19:36:16,645 INFO org.apache.hadoop.yarn.service.AbstractService: > Service:org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl is > started. > 2012-03-15 19:36:16,646 INFO org.apache.hadoop.yarn.service.AbstractService: > Service:org.apache.hadoop.yarn.server.nodemanager.NodeManager is started. > [ end of log ] > > I am seeing strange errors in Zookeeper when the job is submitted: > > 2012-03-15 16:58:00,216 - INFO > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@213] - > Accepted socket connection from /127.0.0.1:33262 > 2012-03-15 16:58:00,219 - INFO > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@838] - Client > attempting to establish new session at /127.0.0.1:33262 > 2012-03-15 16:58:00,229 - INFO [CommitProcessor:0:ZooKeeperServer@604] - > Established session 0x35d53d539f0071 with negotiated timeout 40000 for > client /127.0.0.1:33262 > 2012-03-15 16:58:48,884 - WARN > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@349] - caught end > of stream exception > EndOfStreamException: Unable to read additional data from client sessionid > 0x35d53d539f0071, likely client has closed socket > at > org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:220) > at > org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:224) > at java.lang.Thread.run(Thread.java:662) > 2012-03-15 16:58:48,885 - INFO > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1000] - Closed > socket connection for client /127.0.0.1:33262 which had sessionid > 0x35d53d539f0071 > 2012-03-15 17:02:59,968 - INFO > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@213] - > Accepted socket connection from /127.0.0.1:59652 > 2012-03-15 17:02:59,971 - INFO > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@838] - Client > attempting to establish new session at /127.0.0.1:59652 > 2012-03-15 17:02:59,978 - INFO [CommitProcessor:0:ZooKeeperServer@604] - > Established session 0x35d53d539f0072 with negotiated timeout 40000 for > client /127.0.0.1:59652 > 2012-03-15 17:18:39,480 - WARN > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@349] - caught end > of stream exception > EndOfStreamException: Unable to read additional data from client sessionid > 0x35d53d539f0072, likely client has closed socket > at > org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:220) > at > org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:224) > at java.lang.Thread.run(Thread.java:662) > 2012-03-15 17:18:39,481 - INFO > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1000] - Closed > socket connection for client /127.0.0.1:59652 which had sessionid > 0x35d53d539f0072 > 2012-03-15 17:48:32,406 - INFO > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@213] - > Accepted socket connection from /127.0.0.1:40908 > 2012-03-15 17:48:32,409 - INFO > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@838] - Client > attempting to establish new session at /127.0.0.1:40908 > 2012-03-15 17:48:32,420 - INFO [CommitProcessor:0:ZooKeeperServer@604] - > Established session 0x35d53d539f0073 with negotiated timeout 40000 for > client /127.0.0.1:40908 > 2012-03-15 18:40:19,836 - WARN > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@349] - caught end > of stream exception > EndOfStreamException: Unable to read additional data from client sessionid > 0x35d53d539f0073, likely client has closed socket > at > org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:220) > at > org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:224) > at java.lang.Thread.run(Thread.java:662) > 2012-03-15 18:40:19,837 - INFO > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1000] - Closed > socket connection for client /127.0.0.1:40908 which had sessionid > 0x35d53d539f0073 > 2012-03-15 19:40:30,503 - INFO > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@213] - > Accepted socket connection from /127.0.0.1:45743 > 2012-03-15 19:40:30,506 - INFO > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@838] - Client > attempting to establish new session at /127.0.0.1:45743 > 2012-03-15 19:40:30,518 - INFO [CommitProcessor:0:ZooKeeperServer@604] - > Established session 0x35d53d539f0074 with negotiated timeout 40000 for > client /127.0.0.1:45743 > > Any ideas how to debug this? > > [--- yarn-site.xml ---] > > <?xml version="1.0"?> > <configuration> > > <property> > <name>yarn.nodemanager.local-dirs</name> > <value>/tmp/nm-local-dir</value> > <description>List of directories to store localized files > in.</description> > </property> > > <property> > <name>yarn.nodemanager.log-dirs</name> > <value>/tmp/logs</value> > <description>Where to store container logs.</description> > </property> > > <property> > <name>yarn.nodemanager.resource.memory-mb</name> > <value>1000</value> > <description>Amount of physical memory, in MB, that can be allocated > for containers.</description> > </property> > > <property> > <name>yarn.resourcemanager.address</name> > <value>torb1pna001:8040</value> > <description>The address of the applications manager interface in the > RM.</description> > </property> > > <property> > <name>yarn.resourcemanager.scheduler.address</name> > <value>torb1pna001:8030</value> > <description>The address of the scheduler interface.</description> > </property> > > <property> > <name>yarn.resourcemanager.webapp.address</name> > <value>torb1pna001:8088</value> > <description>The address of the RM web application.</description> > </property> > > <property> > <name>yarn.resourcemanager.resource-tracker.address</name> > <value>torb1pna001:8025</value> > <description></description> > </property> > > <property> > <name>yarn.resourcemanager.admin.address</name> > <value>torb1pna001:8141</value> > <description>The address of the RM admin interface.</description> > </property> > > <property> > <name>yarn.nodemanager.aux-services</name> > <value>mapreduce.shuffle</value> > </property> > > <property> > <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name> > <value>org.apache.hadoop.mapred.ShuffleHandler</value> > </property> > > </configuration> > > [--- mapred-site.xml ---] > > <?xml version="1.0"?> > <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> > > <!-- Put site-specific property overrides in this file. --> > > <configuration> > > <property> > <name>mapreduce.framework.name</name> > <value>yarn</value> > </property> > > > <property> > <name>mapreduce.jobhistory.address</name> > <value>torb1pna001:10020</value> > </property> > > <property> > <name>mapreduce.jobhistory.webapp.address</name> > <value>torb1pna001:19888</value> > </property> > > > <property> > <name>mapreduce.jobhistory.intermediate-done-dir</name> > <value>/tmp/mr-history/tmp</value> > </property> > > <property> > <name>mapreduce.jobhistory.done-dir</name> > <value>/tmp/mr-history/done</value> > </property> > > </configuration> > > [--- output from HBase export ---] > > bash-3.2$ hadoop jar hbase-0.92.0-cdh4b1.jar export dev /backups/dev > 12/03/15 17:48:30 INFO mapreduce.Export: verisons=1, starttime=0, > endtime=9223372036854775807 > 12/03/15 17:48:30 WARN conf.Configuration: mapred.used.genericoptionsparser > is deprecated. Instead, use mapreduce.client.genericoptionsparser.used > 12/03/15 17:48:30 WARN conf.Configuration: io.bytes.per.checksum is > deprecated. Instead, use dfs.bytes-per-checksum > 12/03/15 17:48:32 INFO zookeeper.ZooKeeper: Client > environment:zookeeper.version=3.4.2-1221870, built on 12/21/2011 20:46 GMT > 12/03/15 17:48:32 INFO zookeeper.ZooKeeper: Client > environment:host.name=torb1pna001.dataraker.net > 12/03/15 17:48:32 INFO zookeeper.ZooKeeper: Client > environment:java.version=1.6.0_27 > 12/03/15 17:48:32 INFO zookeeper.ZooKeeper: Client > environment:java.vendor=Sun Microsystems Inc. > 12/03/15 17:48:32 INFO zookeeper.ZooKeeper: Client > environment:java.home=/usr/java/jdk1.6.0_27/jre > 12/03/15 17:48:32 INFO zookeeper.ZooKeeper: Client > environment:java.class.path=/etc/hadoop/conf:/usr/<...snip...> > 12/03/15 17:48:32 INFO zookeeper.ZooKeeper: Client > environment:java.library.path=/usr/lib/hadoop/lib/native > 12/03/15 17:48:32 INFO zookeeper.ZooKeeper: Client > environment:java.io.tmpdir=/tmp > 12/03/15 17:48:32 INFO zookeeper.ZooKeeper: Client > environment:java.compiler=<NA> > 12/03/15 17:48:32 INFO zookeeper.ZooKeeper: Client environment:os.name=Linux > 12/03/15 17:48:32 INFO zookeeper.ZooKeeper: Client environment:os.arch=amd64 > 12/03/15 17:48:32 INFO zookeeper.ZooKeeper: Client > environment:os.version=2.6.18-274.el5 > 12/03/15 17:48:32 INFO zookeeper.ZooKeeper: Client > environment:user.name=hdfs > 12/03/15 17:48:32 INFO zookeeper.ZooKeeper: Client > environment:user.home=/home/hadoop > 12/03/15 17:48:32 INFO zookeeper.ZooKeeper: Client > environment:user.dir=/usr/lib/hbase > 12/03/15 17:48:32 INFO zookeeper.ZooKeeper: Initiating client connection, > connectString=localhost:2181 sessionTimeout=180000 watcher=hconnection > 12/03/15 17:48:32 INFO zookeeper.ClientCnxn: Opening socket connection to > server /127.0.0.1:2181 > 12/03/15 17:48:32 INFO zookeeper.ClientCnxn: Socket connection established > to localhost.localdomain/127.0.0.1:2181, initiating session > 12/03/15 17:48:32 INFO zookeeper.RecoverableZooKeeper: The identifier of > this process is [email protected] > 12/03/15 17:48:32 INFO zookeeper.ClientCnxn: Session establishment complete > on server localhost.localdomain/127.0.0.1:2181, sessionid = > 0x35d53d539f0073, negotiated timeout = 40000 > 12/03/15 17:48:32 WARN conf.Configuration: hadoop.native.lib is deprecated. > Instead, use io.native.lib.available > 12/03/15 17:48:32 WARN conf.Configuration: hadoop.native.lib is deprecated. > Instead, use io.native.lib.available > 12/03/15 17:48:32 WARN conf.Configuration: hadoop.native.lib is deprecated. > Instead, use io.native.lib.available > 12/03/15 17:48:32 WARN conf.Configuration: hadoop.native.lib is deprecated. > Instead, use io.native.lib.available > 12/03/15 17:48:32 INFO mapreduce.JobSubmitter: number of splits:73 > 12/03/15 17:48:32 WARN conf.Configuration: io.bytes.per.checksum is > deprecated. Instead, use dfs.bytes-per-checksum > 12/03/15 17:48:33 INFO mapred.ResourceMgrDelegate: Submitted application > application_1331830579833_0004 to ResourceManager at > torb1pna001/10.192.16.64:8040 > 12/03/15 17:48:33 INFO mapreduce.Job: The url to track the job: > http://torb1pna001:8088/proxy/application_1331830579833_0004/ > 12/03/15 17:48:33 INFO mapreduce.Job: Running job: job_1331830579833_0004 > bash-3.2$ hadoop jar hbase-0.92.0-cdh4b1.jar export bch_dev /backups/bch_dev > WARNING: org.apache.hadoop.metrics.jvm.EventCounter is deprecated. Please > use org.apache.hadoop.log.metrics.EventCounter in all the log4j.properties > files. > 12/03/15 19:40:28 INFO mapreduce.Export: verisons=1, starttime=0, > endtime=9223372036854775807 > 12/03/15 19:40:28 WARN conf.Configuration: mapred.used.genericoptionsparser > is deprecated. Instead, use mapreduce.client.genericoptionsparser.used > 12/03/15 19:40:28 WARN conf.Configuration: io.bytes.per.checksum is > deprecated. Instead, use dfs.bytes-per-checksum > 12/03/15 19:40:30 INFO zookeeper.ZooKeeper: Client > environment:zookeeper.version=3.4.2-1221870, built on 12/21/2011 20:46 GMT > 12/03/15 19:40:30 INFO zookeeper.ZooKeeper: Client > environment:host.name=torb1pna001.dataraker.net > 12/03/15 19:40:30 INFO zookeeper.ZooKeeper: Client > environment:java.version=1.6.0_27 > 12/03/15 19:40:30 INFO zookeeper.ZooKeeper: Client > environment:java.vendor=Sun Microsystems Inc. > 12/03/15 19:40:30 INFO zookeeper.ZooKeeper: Client > environment:java.home=/usr/java/jdk1.6.0_27/jre > 12/03/15 19:40:30 INFO zookeeper.ZooKeeper: Client > environment:java.class.path=/etc/hadoop/conf:<...snip...> > 12/03/15 19:40:30 INFO zookeeper.ZooKeeper: Client > environment:java.library.path=/usr/lib/hadoop/lib/native > 12/03/15 19:40:30 INFO zookeeper.ZooKeeper: Client > environment:java.io.tmpdir=/tmp > 12/03/15 19:40:30 INFO zookeeper.ZooKeeper: Client > environment:java.compiler=<NA> > 12/03/15 19:40:30 INFO zookeeper.ZooKeeper: Client environment:os.name=Linux > 12/03/15 19:40:30 INFO zookeeper.ZooKeeper: Client environment:os.arch=amd64 > 12/03/15 19:40:30 INFO zookeeper.ZooKeeper: Client > environment:os.version=2.6.18-274.el5 > 12/03/15 19:40:30 INFO zookeeper.ZooKeeper: Client > environment:user.name=hdfs > 12/03/15 19:40:30 INFO zookeeper.ZooKeeper: Client > environment:user.home=/home/hadoop > 12/03/15 19:40:30 INFO zookeeper.ZooKeeper: Client > environment:user.dir=/usr/lib/hbase > 12/03/15 19:40:30 INFO zookeeper.ZooKeeper: Initiating client connection, > connectString=localhost:2181 sessionTimeout=180000 watcher=hconnection > 12/03/15 19:40:30 INFO zookeeper.ClientCnxn: Opening socket connection to > server /127.0.0.1:2181 > 12/03/15 19:40:30 INFO zookeeper.ClientCnxn: Socket connection established > to localhost.localdomain/127.0.0.1:2181, initiating session > 12/03/15 19:40:30 INFO zookeeper.RecoverableZooKeeper: The identifier of > this process is [email protected] > 12/03/15 19:40:30 INFO zookeeper.ClientCnxn: Session establishment complete > on server localhost.localdomain/127.0.0.1:2181, sessionid = > 0x35d53d539f0074, negotiated timeout = 40000 > 12/03/15 19:40:30 WARN conf.Configuration: hadoop.native.lib is deprecated. > Instead, use io.native.lib.available > 12/03/15 19:40:30 WARN conf.Configuration: hadoop.native.lib is deprecated. > Instead, use io.native.lib.available > 12/03/15 19:40:30 WARN conf.Configuration: hadoop.native.lib is deprecated. > Instead, use io.native.lib.available > 12/03/15 19:40:30 WARN conf.Configuration: hadoop.native.lib is deprecated. > Instead, use io.native.lib.available > 12/03/15 19:40:30 INFO mapreduce.JobSubmitter: number of splits:73 > 12/03/15 19:40:30 WARN conf.Configuration: io.bytes.per.checksum is > deprecated. Instead, use dfs.bytes-per-checksum > 12/03/15 19:40:31 INFO mapred.ResourceMgrDelegate: Submitted application > application_1331840162147_0001 to ResourceManager at > torb1pna001/10.192.16.64:8040 > 12/03/15 19:40:31 INFO mapreduce.Job: The url to track the job: > http://torb1pna001:8088/proxy/application_1331840162147_0001/ > 12/03/15 19:40:31 INFO mapreduce.Job: Running job: job_1331840162147_0001 -- Harsh J
