Wow, I didn't expect that. That's nastier than usual. I would think that cloning by serializing/deserializing would be unnecessarily slow. I would file a JIRA with Avro asking for a clone() or copy constructor in generated code.
-Joey On Thu, Aug 4, 2011 at 5:07 PM, Vyacheslav Zholudev <[email protected]> wrote: > Just sharing my today's discovery: > Hadoop also reuses objects in internal lists, in my example the BAR objects. > That is if the first FOO object has two BAR objects in the list, then the > second FOO object will contain the same (equal by reference) first two BAR > objects in the list. So in case of Avro it would be good if auto-generated > code implemented a 'clone' method. > Btw, is it good to clone avro-specific objects by serializing/deserializing > using SpecificDatum{Writer|Reader}? > Vyacheslav > > On 4 August 2011 21:35, <[email protected]> wrote: >> >> HADOOP-2399 has caused a lot of problems for users so far, and the saga >> still continues :-( >> >> I remember spending 18 straight hours in 2008 with a user debugging this >> issue. >> >> - milind >> >> --- >> Milind Bhandarkar >> Greenplum Labs, EMC >> (Disclaimer: Opinions expressed in this email are those of the author, and >> do >> not necessarily represent the views of any organization, past or present, >> the author might be affiliated with.) >> >> >> >> >> On 8/3/11 4:19 AM, "Joey Echeverria" <[email protected]> wrote: >> >> >Hadoop reuses objects as an optimization. If you need to keep a copy >> >in memory, you need to call clone yourself. I've never used Avro, but >> >my guess is that the BARs are not reused, only the FOO. >> > >> >-Joey >> > >> >On Wed, Aug 3, 2011 at 3:18 AM, Vyacheslav Zholudev >> ><[email protected]> wrote: >> >> Hi all, >> >> >> >> I'm using Avro as a serialization format and assume I have a generated >> >>specific class FOO that I use as a Mapper output format: >> >> >> >> class FOO { >> >> int a; >> >> List<BAR> barList; >> >> } >> >> >> >> where BAR is another generated specific Java class. >> >> >> >> When I iterate over "Iterable<FOO> values" in the Reducer it is clear >> >>that the same object of class FOO is reused, i.e. >> >> FOO foo1 = values.iterator.next(); >> >> FOO foo2 = values.iterator.next(); >> >> assertThat(foo1 == foo2, is (true)); >> >> >> >> So I have the following questions: >> >> 1) Is the list barList reused over the next() calls? >> >> 2) If yes, can the objects that are in the barList be reused? For >> >>example, if the first time next() is called, the list contains two BAR >> >>objects, the next time next() is called the barList contains 3 objects >> >>and 2 of them are equal by reference to the two from the list of the >> >>first next() call. In other words, does Hadoop maintain some sort of >> >>"object pool"? >> >> 3) Why do not AvroTools generate clone() methods since it would be >> >>quite straightforward and more importantly useful given that objects are >> >>reused? >> >> >> >> Thanks a lot in advance! >> >> >> >> Vyacheslav >> >> >> >> >> >> >> >> >> > >> > >> > >> >-- >> >Joseph Echeverria >> >Cloudera, Inc. >> >443.305.9434 >> > >> > > > > -- > Best, > Vyacheslav Zholudev > -- Joseph Echeverria Cloudera, Inc. 443.305.9434
