Hello Anirudh, On 02-Jan-2012, at 5:31 AM, Anirudh wrote:
> Any specific reason why setup is called for every task attempt. For > optimization point of view, wouldnt it be good if the setup is called only > once in case of JVM reuse. Note that the task setup/cleanup procedures are separate from the API hooks 'setup'/'cleanup'. The latter is guaranteed to the developer to be called per split/partition for pre-processing and post-processing execution. > I have not yet looked at the implementation, in case of JVM reuse is the > application Mapper instance reused or a new instance is created for every > task attempt? That wouldn't be good isolation-wise. Resetting a lot of other relevant parameters would be much more costly than reinitializing the whole task (but not the JVM). I think https://issues.apache.org/jira/browse/HADOOP-249, which introduced this, should help you. > > My suggestion for Eyal would be to have a static field initializer expression > in the Mapper to create the helper class instance. This will ensure that the > helper class will be instantiated when the Mapper class is loaded. Yep this is possible, surely, and is a good advantage to using JVM reuse. > On Sun, Jan 1, 2012 at 7:05 AM, Harsh J <[email protected]> wrote: > You are guaranteed one setup call for every single task attempt. This > is regardless of JVM reuse being on or off. JVM reuse will cause no > issues with what Eyal is attempting to do. > > On Sun, Jan 1, 2012 at 5:49 PM, Anirudh <[email protected]> wrote: > > No problems Eyal. > > > > On a second thought, for the JVM re-use the Mapper/Reducer instances should > > be re-used, and the setup should be called only once. This makes sense too > > as the JVM reuse is for the same job. > > You should be good with class instantiation even if the JVM reuse is > > enabled. > > > > > > On Sat, Dec 31, 2011 at 11:39 PM, Eyal Golan <[email protected]> wrote: > >> > >> Thank you very much for the detailed explanation Anirudh. > >> > >> I think that my question about node / VM was due to some lack of knowledge > >> (I'm just starting to learn the Hadoop environment). > >> Regarding configuration of the nodes and clusters. > >> This is something that I am not doing by myself. We have a dedicated team > >> for managing the Hadoop cluster and I'll ask them. > >> > >> I think that my question should have been: How many instances of the > >> 'helper' class will be created in a single VM. > >> And, as I understand, consider I am creating the helper in the setup / > >> configure method, there would be one. > >> And as long as it's stateless, I'm good. > >> > >> Thanks again, > >> > >> Eyal > >> > >> > >> > >> Eyal Golan > >> [email protected] > >> > >> Visit: http://jvdrums.sourceforge.net/ > >> LinkedIn: http://www.linkedin.com/in/egolan74 > >> Skype: egolan74 > >> > >> P Save a tree. Please don't print this e-mail unless it's really > >> necessary > >> > >> > >> > >> On Sat, Dec 31, 2011 at 1:36 PM, Anirudh <[email protected]> wrote: > >>> > >>> I just wanted to confirm where exactly you were planning to have the > >>> instantiation code, as it was not mentioned in your previous post. The > >>> location would have made difference. As you are doing it in the setup of > >>> mapper/reducer, you are good. > >>> > >>> I was referring to the Task JVM Reuse option: > >>> > >>> http://hadoop.apache.org/common/docs/current/mapred_tutorial.html#Task+JVM+Reuse > >>> > >>> It states that if the option to reuse JVM is enabled, the same Task JVM > >>> will execute multiple tasks(i.e. map/reduce). I am not sure how this is > >>> implemented, whether a new Mapper/Reducer is created for each task or they > >>> too are re-reused. > >>> If a new instance is created each time, then the mapper/reducer and all > >>> its reference will be marked for garbage collection and you would be good. > >>> If the Mapper/Reducer instances are re-used then the setup should be > >>> called again creating another instance of your helper class. > >>> > >>> In my opinion the latter does not make sense, and the implementation > >>> would be according to the prior approach i.e. creation of a new > >>> Mapper/Reducer for each Task. But it would be interesting to check. > >>> > >>> As the classes in question are helper classes(stateless) you may not get > >>> affected in terms of functionality. > >>> > >>> I am not clear on one of your statement: > >>> > >>> How many map tasks will be created? One per split or one per VM (node)? > >>> Are you suggesting that although there would be one Mapper in the node... > >>> > >>> Have you configured your node to have a single slot for map/reduce task? > >>> If yes then there will be one Mapper/Reducer task in the node. If no there > >>> could be more than one mapper/reducer in the node depending on lots of > >>> other > >>> paramerters i.e. no of mappers/reducers slots allocated on the node, no. > >>> of > >>> input splits etc. If the node is configured to run more than one > >>> Mapper/Reducer task the scheduler may choose to run more than one task on > >>> the same node. The default is 2 Map & 2 Reduce tasks per node. And for > >>> each > >>> task a new JVM is launched unless the JVM reuse option is enabled. > >>> > >>> Thanks, > >>> Anirudh > >>> > >>> > >>> On Sat, Dec 31, 2011 at 1:28 AM, Eyal Golan <[email protected]> wrote: > >>>> > >>>> My idea is to create that class in the setup / configure method (depends > >>>> which Mapper / Reducer I will inherit from). > >>>> > >>>> I don't understand the 'reuse' option you are referring to. > >>>> How many map tasks will be created? One per split or one per VM (node)? > >>>> Are you suggesting that although there would be one Mapper in the node, > >>>> each new operator (or reflecting) will create a new instance? > >>>> Thus making lots of that instance? > >>>> > >>>> BTW, > >>>> these helper class I want to create are of course not going to be > >>>> stateful. They are defiantly 'helper' class that have some logic. > >>>> > >>>> Thanks, > >>>> > >>>> Eyal > >>>> > >>>> Eyal Golan > >>>> [email protected] > >>>> > >>>> Visit: http://jvdrums.sourceforge.net/ > >>>> LinkedIn: http://www.linkedin.com/in/egolan74 > >>>> Skype: egolan74 > >>>> > >>>> P Save a tree. Please don't print this e-mail unless it's really > >>>> necessary > >>>> > >>>> > >>>> > >>>> On Sat, Dec 31, 2011 at 6:50 AM, Anirudh <[email protected]> > >>>> wrote: > >>>>> > >>>>> Where are you creating this new class. If it is in the map function, > >>>>> then it will be create a new object for each record in the split. > >>>>> > >>>>> Also you may need to see how the JVM reuse option works. I am not too > >>>>> sure of this and you may want to look at the code. If the option for JVM > >>>>> reuse is set, then my understanding is for every task, a new Map task > >>>>> would > >>>>> be created and in that case the "new" operator will create another > >>>>> instance > >>>>> even if this statement is not in the map function. > >>>>> > >>>>> > >>>>> On Fri, Dec 30, 2011 at 6:22 AM, Eyal Golan <[email protected]> wrote: > >>>>>> > >>>>>> Great News !! > >>>>>> Thanks for the info. > >>>>>> > >>>>>> So using reflection, I can inject different implementations of > >>>>>> interfaces (services) for the mapper (or reducer). > >>>>>> And this way I can test a mapper (or reducer). > >>>>>> Just by reflecting a stub instead of a real implementation. > >>>>>> > >>>>>> Thanks, > >>>>>> > >>>>>> > >>>>>> > >>>>>> Eyal Golan > >>>>>> [email protected] > >>>>>> > >>>>>> Visit: http://jvdrums.sourceforge.net/ > >>>>>> LinkedIn: http://www.linkedin.com/in/egolan74 > >>>>>> Skype: egolan74 > >>>>>> > >>>>>> P Save a tree. Please don't print this e-mail unless it's really > >>>>>> necessary > >>>>>> > >>>>>> > >>>>>> > >>>>>> On Fri, Dec 30, 2011 at 2:50 PM, Harsh J <[email protected]> wrote: > >>>>>>> > >>>>>>> Eyal, > >>>>>>> > >>>>>>> Yes, it is right to think of each Task attempt being one individual > >>>>>>> JVM running individually on any added Node. Multiple slots would mean > >>>>>>> multiple VMs in parallel as well. Yes, your use of reflection to > >>>>>>> build your > >>>>>>> objects will work just fine -- its all user-side java code that is > >>>>>>> executed. > >>>>>>> > >>>>>>> On 30-Dec-2011, at 4:42 PM, Eyal Golan wrote: > >>>>>>> > >>>>>>> Hi, > >>>>>>> > >>>>>>> I want to understand a basic concept in MR. > >>>>>>> > >>>>>>> If a mapper creates an instance of some class (using the 'new' > >>>>>>> operator), then the created class exists ONCE in the VM of this node. > >>>>>>> For each node. > >>>>>>> Correct? > >>>>>>> > >>>>>>> Now, > >>>>>>> what if instead of using the 'new' operator, the class is created > >>>>>>> using reflection. > >>>>>>> Is it valid in a MR? > >>>>>>> Will only one instance of the created class be existing in that node? > >>>>>>> > >>>>>>> Thanks, > >>>>>>> > >>>>>>> > >>>>>>> Eyal > >>>>>>> > >>>>>>> Eyal Golan > >>>>>>> [email protected] > >>>>>>> > >>>>>>> Visit: http://jvdrums.sourceforge.net/ > >>>>>>> LinkedIn: http://www.linkedin.com/in/egolan74 > >>>>>>> Skype: egolan74 > >>>>>>> > >>>>>>> P Save a tree. Please don't print this e-mail unless it's really > >>>>>>> necessary > >>>>>>> > >>>>>>> > >>>>>> > >>>>> > >>>> > >>> > >> > > > > > > -- > Harsh J >
