Hi Satish
After changing dfs.block.size to 40 did to recopy the files. Changing
dfs.block.size won't affect the existing files in hdfs it would be applicable
from the new files you copy to hdfs. In short with the changes in
dfs.block.size=40,
mapred.min.split.size=0,mapred.max.split.size=40 do a copyFromLocal and try
executing your job on this newly copied data.
Regards
Bejoy K S
-----Original Message-----
From: "Satish Setty (HCL Financial Services)" <[email protected]>
Date: Tue, 10 Jan 2012 08:57:37
To: Bejoy Ks<[email protected]>
Cc: [email protected]<[email protected]>
Subject: RE: hadoop
Hi Bejoy,
Thanks for help. Changed values
mapred.min.split.size=0,mapred.max.split.size=40 but but job counter does not
reflect any other changes?
For posting kindly let me know correct link/mail-id - at present directly
sending to your account["Bejoy Ks [[email protected]]" - has been great
help to me.
Posting to group account
[email protected] <mailto:[email protected]>
bounces back.
Counter Map Reduce Total
File Input Format Counters Bytes Read 61 0 61
Job Counters SLOTS_MILLIS_MAPS 0 0 3,886
Launched map tasks 0 0 2
Data-local map tasks 0 0 2
FileSystemCounters HDFS_BYTES_READ 267 0 267
FILE_BYTES_WRITTEN 58,134 0 58,134
Map-Reduce Framework Map output materialized bytes 0 0 0
Combine output records 0 0 0
Map input records 9 0 9
Spilled Records 0 0 0
Map output bytes 70 0 70
Map input bytes 54 0 54
SPLIT_RAW_BYTES 206 0 206
Map output records 7 0 7
Combine input records 0 0 0
----------------
From: Bejoy Ks [[email protected]]
Sent: Monday, January 09, 2012 11:13 PM
To: Satish Setty (HCL Financial Services)
Cc: [email protected]
Subject: Re: hadoop
Hi Satish
It would be good if you don't cross post your queries. Just post it once
on the right list.
What is your value for mapred.max.split.size? Try setting these values
as well
mapred.min.split.size=0 (it is the default value)
mapred.max.split.size=40
Try executing your job once you apply these changes on top of others you did.
Regards
Bejoy.K.S
On Mon, Jan 9, 2012 at 5:09 PM, Satish Setty (HCL Financial Services)
<[email protected] <mailto:[email protected]> > wrote:
Hi Bejoy,
Even with below settings map tasks never go beyound 2, any way to make this
spawn 10 tasks. Basically it should look like compute grid - computation in
parallel.
<property>
<name>io.bytes.per.checksum</name>
<value>30</value>
<description>The number of bytes per checksum. Must not be larger than
io.file.buffer.size.</description>
</property>
<property>
<name>dfs.block.size</name>
<value>30</value>
<description>The default block size for new files.</description>
</property>
<property>
<name>mapred.tasktracker.map.tasks.maximum</name>
<value>10</value>
<description>The maximum number of map tasks that will be run
simultaneously by a task tracker.
</description>
</property>
----------------
From: Satish Setty (HCL Financial Services)
Sent: Monday, January 09, 2012 1:21 PM
To: Bejoy Ks
Cc: [email protected] <mailto:[email protected]>
Subject: RE: hadoop
Hi Bejoy,
In hdfs I have set block size - 40bytes . Input Data set is as below
data1 (5*8=40 bytes)
data2
......
data10
But still I see only 2 map tasks spawned, should have been atleast 10 map
tasks. Not sure how works internally. Line feed does not work [as you have
explained below]
Thanks
----------------
From: Satish Setty (HCL Financial Services)
Sent: Saturday, January 07, 2012 9:17 PM
To: Bejoy Ks
Cc: [email protected] <mailto:[email protected]>
Subject: RE: hadoop
Thanks Bejoy - great information - will try out.
I meant for below problem single node with high configuration -> 8 cpus and 8gb
memory. Hence taking an example of 10 data items with line feeds. We want
to utilize full power of machine - hence want at least 10 map tasks - each task
needs to perform highly complex mathematical simulation. At present it looks
like file data is the only way to specify number of map tasks via splitsize
(in bytes) - but I prefer some criteria like line feed or whatever.
In below example - 'data1' corresponds to 5*8=40bytes, if I have data1 ....
data10 in theory I need to see 10 map tasks with split size of 40bytes.
How do I perform logging - where is the log (apache logger) data written?
system outs may not come as it is background process.
Regards
----------------
From: Bejoy Ks [[email protected] <mailto:[email protected]> ]
Sent: Saturday, January 07, 2012 7:35 PM
To: Satish Setty (HCL Financial Services)
Cc: [email protected] <mailto:[email protected]>
Subject: Re: hadoop
Hi Satish
Please find some pointers inline
Problem - As per documentation filesplits corresponds to number of map tasks.
File split is governed by bock size - 64mb in hadoop-0.20.203.0. Where can I
find default settings for variour parameters like block size, number of
map/reduce tasks.
[Bejoy] I'd rather state it other way round, the number of map tasks triggered
by a MR job is determined by number of input splits (and input format). If you
use TextInputFormat with default settings the number of input splits is equal
to the no of hdfs blocks occupied by the input. Size of an input split is equal
to hdfs block size in default(64Mb). If you want to have more splits for one
hdfs block itself you need to set a value less than 64 Mb for
mapred.max.split.size.
You can find pretty much all default configuration values from the downloaded
.tar at
hadoop-0.20.*/src/mapred/mapred-default.xml
hadoop-0.20.*/src/hdfs/hdfs-default.xml
hadoop-0.20.*/src/core/core-default.xml
If you want to alter some of these values then you can provide the same in
$HADOOP_HOME/conf/mapred-site.xml
$HADOOP_HOME/conf/hdfs-site.xml
$HADOOP_HOME/conf/core-site.xml
These values provided in *-site.xml would be taken into account only if they
are not marked in *-default.xml. If not final, the values provided in
*-site.xml overrides the values in *-default.xml for corresponding
configuration parameter.
I require atleast 10 map taks which is same as number of "line feeds". Each
corresponds to complex calculation to be done by map task. So I can have
optimal cpu utilization - 8 cpus.
[Bejoy] Hadoop is a good choice processing large amounts of data. It is not
wise to choose one mapper for one record/line in a file, as creation of a map
task itself is expensive with jvm spanning and all. Currently you may have 10
records in your input but I believe you are just testing Hadoop in dev env and
in production that wouldn't be the case there could be n files having m records
each and this m can be in millions.(Just assuming based on my experience). On
larger data sets you may not need to split on line boundaries. There can be
multiple lines in a file and if you use TextInputFormat it is just one line
processed by a map task at an instant. If you have n map tasks then n lines
could be getting processed at an instant of map task execution time frame one
by each map task. In larger data volumes map tasks are spanned in specific
nodes primarily based on data locality, then on available tasks slots on data
local node and so on. It is possible that if you have a 10 node cluster, 10
hdfs blocks corresponding to a input file and assume that all the blocks are
present only on 8 nodes and there are sufficient task slots available on all 8
, then tasks for your job may be executed in 8 nodes alone instead of 10. So
there are chances that there won't be 100% balanced CPU utilization across
nodes in a cluster.
I'm not really sure how you can spawn map tasks based on line
feeds in a file .Let us wait for others to comment on this.
Also if your using map reduce for parallel computation alone the
make sure you set the number of reducers to zero, with that you can save a lot
of time that would be other wise spend on sort and shuffle phases.
(-D mapred.reduce.tasks=0)
Behaviour of maptasks looks strange to be as some times if I give in program
jobconf.set(num map tasks) it takes 2 or 8.
[Bejoy]There is no default value for number of map tasks, it is determined by
input splits and input format used by your job. You cannot set the number of
map tasks even if you set them at your job level, it is not considered.
(mapred.map.tasks) . But you can definitely specify the number of reduce tasks
at your job level by job.setNumReduceTasks(n) or mapred.reduce.tasks. If not
set it would take the default value for reduce tasks specified in conf files.
I see some files like part-00001...
Are they partitions?
[Bejoy] The part-000* files corresponds to reducers. You'd have n files if you
have n reducers as one reducer produces one output file.
Hope it helps!..
Regards
Bejoy.KS
On Sat, Jan 7, 2012 at 3:32 PM, Satish Setty (HCL Financial Services)
<[email protected] <mailto:[email protected]> > wrote:
Hi Bijoy,
Just finished installation and tested sample applications.
Problem - As per documentation filesplits corresponds to number of map tasks.
File split is governed by bock size - 64mb in hadoop-0.20.203.0. Where can I
find default settings for variour parameters like block size, number of
map/reduce tasks.
Is it possible to control filesplit by "line feed - \n". I tried giving sample
input -> jobconf -> TextInputFormat
date1
date2
date3
.......
......
date10
But when I run I see number of maptasks=2 or 1.
I require atleast 10 map taks which is same as number of "line feeds". Each
corresponds to complex calculation to be done by map task. So I can have
optimal cpu utilization - 8 cpus.
Behaviour of maptasks looks strange to be as some times if I give in program
jobconf.set(num map tasks) it takes 2 or 8. I see some files like
part-00001...
Are they partitions?
Thanks
----------------
From: Satish Setty (HCL Financial Services)
Sent: Friday, January 06, 2012 12:29 PM
To: [email protected] <mailto:[email protected]>
Subject: FW: hadoop
Thanks Bejoy. Extremely useful information. We will try and come back. WebApp
application [jobtracker web UI ] does this require deployment or application
server container comes inbuilt with hadoop?
Regards
----------------
From: Bejoy Ks [[email protected] <mailto:[email protected]> ]
Sent: Friday, January 06, 2012 12:54 AM
To: [email protected] <mailto:[email protected]>
Subject: Re: hadoop
Hi Satish
Please find some pointers in line
(a) How do we know number of map tasks spawned? Can this be controlled? We
notice only 4 jvms running on a single node - namenode, datanode, jobtracker,
tasktracker. As we understand depending on number of splits that many map tasks
are spawned - so we should see that many increase in jvms.
[Bejoy] namenode, datanode, jobtracker, tasktracker, secondaryNameNode are the
default process on hadoop it is not dependent on your tasks and your tasks are
custom tasks are launched in separate jvms. You can control the maximum number
of mappers on each tasktracker at an instance by setting
mapred.tasktracker.map.tasks.maximum. In default all the tasks (map or reduce)
are executed on individual jvms and once the task is completed the jvms are
destroyed. You are right, in default one map task is launched per input split.
Just check the jobtracker web UI
(http://nameNodeHostName:50030/jobtracker.jsp), it would give you you all
details on the job including the number of map tasks spanned by a job. If you
want to run multiple task tracker and data node instances on the same machine
you need to ensure that there are no port conflicts.
(b) Our mapper class should perform complex computations - it has plenty of
dependent jars so how do we add all jars in class path while running
application? Since we require to perform parallel computations - we need many
map tasks running in parallel with different data. All are in same machine with
different jvms.
[Bejoy] If these dependent jars are used by almost all your applications
include the same in class path of all your nodes.(in your case just one node).
Alternatively you can use -libjars option while submitting your job. For more
details refer
http://www.cloudera.com/blog/2011/01/how-to-include-third-party-libraries-in-your-map-reduce-job/
(c) How does data split happen? JobClient does not talk about data splits? As
we understand we create format for distributed file system, start-all.sh and
then "hadoop fs -put". Do this write data to all datanodes? But we are unable
to see physical location? How does split happen from this hdfs source?
[Bejoy] Input files are split into blocks during copy into hdfs itself , the
size of each block is detmined from the hadoop configuration of your cluster.
Name node decides on which all datanodes these blocks are to be placed
including its replicas and this details are passed on to the client. The client
copies the blocks to one data node and from this data node the block is
replicated to other datanodes. The splitting of a file happens in HDFS API
level.
thanks
----------------
::DISCLAIMER::
-----------------------------------------------------------------------------------------------------------------------
The contents of this e-mail and any attachment(s) are confidential and
intended for the named recipient(s) only.
It shall not attach any liability on the originator or HCL or its affiliates.
Any views or opinions presented in
this email are solely those of the author and may not necessarily reflect the
opinions of HCL or its affiliates.
Any form of reproduction, dissemination, copying, disclosure, modification,
distribution and / or publication of
this message without the prior written consent of the author of this e-mail is
strictly prohibited. If you have
received this email in error please delete it and notify the sender
immediately. Before opening any mail and
attachments please check them for viruses and defect.
-----------------------------------------------------------------------------------------------------------------------