Re: instantiation of classes in MR

Harsh J Mon, 02 Jan 2012 03:20:03 -0800

Hello Anirudh,

On 02-Jan-2012, at 5:31 AM, Anirudh wrote:


> Any specific reason why setup is called for every task attempt. For 
> optimization point of view, wouldnt it be good if the setup is called only 
> once in case of JVM reuse.

Note that the task setup/cleanup procedures are separate from the API hooks 
'setup'/'cleanup'. The latter is guaranteed to the developer to be called per 
split/partition for pre-processing and post-processing execution.

> I have not yet looked at the implementation, in case of JVM reuse is the 
> application Mapper instance reused or a new instance is created for every 
> task attempt?

That wouldn't be good isolation-wise. Resetting a lot of other relevant 
parameters would be much more costly than reinitializing the whole task (but 
not the JVM). I think https://issues.apache.org/jira/browse/HADOOP-249, which 
introduced this, should help you.

> 
> My suggestion for Eyal would be to have a static field initializer expression 
> in the Mapper to create the helper class instance. This will ensure that the 
> helper class will be instantiated when the Mapper class is loaded.

Yep this is possible, surely, and is a good advantage to using JVM reuse.

> On Sun, Jan 1, 2012 at 7:05 AM, Harsh J <[email protected]> wrote:
> You are guaranteed one setup call for every single task attempt. This
> is regardless of JVM reuse being on or off. JVM reuse will cause no
> issues with what Eyal is attempting to do.
> 
> On Sun, Jan 1, 2012 at 5:49 PM, Anirudh <[email protected]> wrote:
> > No problems Eyal.
> >
> > On  a second thought, for the JVM re-use the Mapper/Reducer instances should
> > be re-used, and the setup should be called only once. This makes sense too
> > as the JVM reuse is for the same job.
> > You should be good with class instantiation even if the JVM reuse is
> > enabled.
> >
> >
> > On Sat, Dec 31, 2011 at 11:39 PM, Eyal Golan <[email protected]> wrote:
> >>
> >> Thank you very much for the detailed explanation Anirudh.
> >>
> >> I think that my question about node / VM was due to some lack of knowledge
> >> (I'm just starting to learn the Hadoop environment).
> >> Regarding configuration of the nodes and clusters.
> >> This is something that I am not doing by myself. We have a dedicated team
> >> for managing the Hadoop cluster and I'll ask them.
> >>
> >> I think that my question should have been: How many instances of the
> >> 'helper' class will be created in a single VM.
> >> And, as I understand, consider I am creating the helper in the setup /
> >> configure method, there would be one.
> >> And as long as it's stateless, I'm good.
> >>
> >> Thanks again,
> >>
> >> Eyal
> >>
> >>
> >>
> >> Eyal Golan
> >> [email protected]
> >>
> >> Visit: http://jvdrums.sourceforge.net/
> >> LinkedIn: http://www.linkedin.com/in/egolan74
> >> Skype: egolan74
> >>
> >> P  Save a tree. Please don't print this e-mail unless it's really
> >> necessary
> >>
> >>
> >>
> >> On Sat, Dec 31, 2011 at 1:36 PM, Anirudh <[email protected]> wrote:
> >>>
> >>> I just wanted to confirm where exactly you were planning to have the
> >>> instantiation code, as it was not mentioned in your previous post. The
> >>> location would have made difference. As you are doing it in the setup of
> >>> mapper/reducer, you are good.
> >>>
> >>> I was referring to the Task JVM Reuse option:
> >>>
> >>> http://hadoop.apache.org/common/docs/current/mapred_tutorial.html#Task+JVM+Reuse
> >>>
> >>> It states that if the option to reuse JVM is enabled, the same Task JVM
> >>> will execute multiple tasks(i.e. map/reduce). I am not sure how this is
> >>> implemented, whether a new Mapper/Reducer is created for each task or they
> >>> too are re-reused.
> >>> If a new instance is created each time, then the mapper/reducer and  all
> >>> its reference will be marked for garbage collection and you would be good.
> >>> If the Mapper/Reducer instances are re-used then the setup should be
> >>> called again creating another instance of your helper class.
> >>>
> >>> In my opinion the latter does not make sense, and the implementation
> >>> would be according to the prior approach i.e. creation of a new
> >>> Mapper/Reducer for each Task. But it would be interesting to check.
> >>>
> >>> As the classes in question are helper classes(stateless) you may not get
> >>> affected in terms of functionality.
> >>>
> >>> I am not clear on one of your statement:
> >>>
> >>> How many map tasks will be created? One per split or one per VM (node)?
> >>> Are you suggesting that although there would be one Mapper in the node...
> >>>
> >>> Have you configured your node to have a single slot for map/reduce task?
> >>> If yes then there will be one Mapper/Reducer task in the node. If no there
> >>> could be more than one mapper/reducer in the node depending on lots of 
> >>> other
> >>> paramerters i.e. no of mappers/reducers slots allocated on the node, no. 
> >>> of
> >>> input splits etc. If the node is configured to run more than one
> >>> Mapper/Reducer task the scheduler may choose to run more than one task on
> >>> the same node. The default is 2 Map & 2 Reduce tasks per node. And for 
> >>> each
> >>> task a new JVM is launched unless the JVM reuse option is enabled.
> >>>
> >>> Thanks,
> >>> Anirudh
> >>>
> >>>
> >>> On Sat, Dec 31, 2011 at 1:28 AM, Eyal Golan <[email protected]> wrote:
> >>>>
> >>>> My idea is to create that class in the setup / configure method (depends
> >>>> which Mapper / Reducer I will inherit from).
> >>>>
> >>>> I don't understand the 'reuse' option you are referring to.
> >>>> How many map tasks will be created? One per split or one per VM (node)?
> >>>> Are you suggesting that although there would be one Mapper in the node,
> >>>> each new operator (or reflecting) will create a new instance?
> >>>> Thus making lots of that instance?
> >>>>
> >>>> BTW,
> >>>> these helper class I want to create are of course not going to be
> >>>> stateful. They are defiantly 'helper' class that have some logic.
> >>>>
> >>>> Thanks,
> >>>>
> >>>> Eyal
> >>>>
> >>>> Eyal Golan
> >>>> [email protected]
> >>>>
> >>>> Visit: http://jvdrums.sourceforge.net/
> >>>> LinkedIn: http://www.linkedin.com/in/egolan74
> >>>> Skype: egolan74
> >>>>
> >>>> P  Save a tree. Please don't print this e-mail unless it's really
> >>>> necessary
> >>>>
> >>>>
> >>>>
> >>>> On Sat, Dec 31, 2011 at 6:50 AM, Anirudh <[email protected]>
> >>>> wrote:
> >>>>>
> >>>>> Where are you creating this new class. If it is in the map function,
> >>>>> then it will be create a new object for each record in the split.
> >>>>>
> >>>>> Also you may need to see how the JVM reuse option works. I am not too
> >>>>> sure of this and you may want to look at the code. If the option for JVM
> >>>>> reuse is set, then my understanding is for every task, a new Map task 
> >>>>> would
> >>>>> be created and in that case the "new" operator will create another 
> >>>>> instance
> >>>>> even if this statement is not in the map function.
> >>>>>
> >>>>>
> >>>>> On Fri, Dec 30, 2011 at 6:22 AM, Eyal Golan <[email protected]> wrote:
> >>>>>>
> >>>>>> Great News !!
> >>>>>> Thanks for the info.
> >>>>>>
> >>>>>> So using reflection, I can inject different implementations of
> >>>>>> interfaces (services) for the mapper (or reducer).
> >>>>>> And this way I can test a mapper (or reducer).
> >>>>>> Just by reflecting a stub instead of a real implementation.
> >>>>>>
> >>>>>> Thanks,
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> Eyal Golan
> >>>>>> [email protected]
> >>>>>>
> >>>>>> Visit: http://jvdrums.sourceforge.net/
> >>>>>> LinkedIn: http://www.linkedin.com/in/egolan74
> >>>>>> Skype: egolan74
> >>>>>>
> >>>>>> P  Save a tree. Please don't print this e-mail unless it's really
> >>>>>> necessary
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> On Fri, Dec 30, 2011 at 2:50 PM, Harsh J <[email protected]> wrote:
> >>>>>>>
> >>>>>>> Eyal,
> >>>>>>>
> >>>>>>> Yes, it is right to think of each Task attempt being one individual
> >>>>>>> JVM running individually on any added Node. Multiple slots would mean
> >>>>>>> multiple VMs in parallel as well. Yes, your use of reflection to 
> >>>>>>> build your
> >>>>>>> objects will work just fine -- its all user-side java code that is 
> >>>>>>> executed.
> >>>>>>>
> >>>>>>> On 30-Dec-2011, at 4:42 PM, Eyal Golan wrote:
> >>>>>>>
> >>>>>>> Hi,
> >>>>>>>
> >>>>>>> I want to understand a basic concept in MR.
> >>>>>>>
> >>>>>>> If a mapper creates an instance of some class (using the 'new'
> >>>>>>> operator), then the created class exists ONCE in the VM of this node.
> >>>>>>> For each node.
> >>>>>>> Correct?
> >>>>>>>
> >>>>>>> Now,
> >>>>>>> what if instead of using the 'new' operator, the class is created
> >>>>>>> using reflection.
> >>>>>>> Is it valid in a MR?
> >>>>>>> Will only one instance of the created class be existing in that node?
> >>>>>>>
> >>>>>>> Thanks,
> >>>>>>>
> >>>>>>>
> >>>>>>> Eyal
> >>>>>>>
> >>>>>>> Eyal Golan
> >>>>>>> [email protected]
> >>>>>>>
> >>>>>>> Visit: http://jvdrums.sourceforge.net/
> >>>>>>> LinkedIn: http://www.linkedin.com/in/egolan74
> >>>>>>> Skype: egolan74
> >>>>>>>
> >>>>>>> P  Save a tree. Please don't print this e-mail unless it's really
> >>>>>>> necessary
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> >
> 
> 
> 
> --
> Harsh J
>

Re: instantiation of classes in MR

Reply via email to