[jira] [Comment Edited] (HADOOP-13010) Refactor raw erasure coders

Colin Patrick McCabe (JIRA) Thu, 14 Apr 2016 11:29:41 -0700

    [ 
https://issues.apache.org/jira/browse/HADOOP-13010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15241679#comment-15241679
 ]


Colin Patrick McCabe edited comment on HADOOP-13010 at 4/14/16 6:29 PM:
------------------------------------------------------------------------

Thanks for this, [~drankye].  Looks good overall!

I like the idea of moving some of the utility stuff into {{CoderUtil.java}}.

{code}
  static byte[] getEmptyChunk(int leastLength) {
    if (emptyChunk.length >= leastLength) {
      return emptyChunk; // In most time
    }
    synchronized (AbstractRawErasureCoder.class) {
      emptyChunk = new byte[leastLength];
    }
    return emptyChunk;
  }
{code}
This isn't safe for multiple threads, since we could be reading 
{{CoderUtil#emptyChunk}} while it's in the middle of being written.  You must 
either make this {{volatile}} or hold the lock for this entire function.

It's unfortunate that we need a function like this-- I was hoping that there 
would be some more efficient way of zeroing a ByteBuffer.  One thing that's a 
little concerning here is that a caller could modify the array returned by 
{{getEmptyChunk}}, which would cause problems for other callers.  To avoid 
this, it's probably better to make this {{private}} to {{CoderUtil.java}}.

{code}
  static ByteBuffer convertInputBuffer(byte[] input, int offset, int len) {
{code}
Hmm.  This name seems a bit confusing.  What this function does has nothing to 
do with whether the buffer is for "input" versus "output"-- it's just copying 
data from an array to a {{DirectByteBuffer}}.  It's also not so much a 
"conversion" as a "copy".  Maybe something like {{cloneAsDirectByteBuffer}} 
would be a better name?

{code}
  static <T> int[] getErasedOrNotToReadIndexes(T[] inputs) {
{code}
Should be named {{getNullIndexes}}?

{code}
  static <T> T findFirstValidInput(T[] inputs) {
    if (inputs.length > 0 && inputs[0] != null) {
      return inputs[0];
    }

    for (T input : inputs) {
      if (input != null) {
        return input;
      }
    }
    ...
{code}
Why do we need the special case for the first element here?

{code}
  static <T> void makeValidIndexes(T[] inputs, int[] validIndexes) {
{code}
Should be named {{getNonNullIndexes}}?  Also, why does this one take an array 
passed in, whereas {{getNullIndexes}} returns an array?  I also don't see how 
the caller is supposed to know how many of the array slots were used by the 
function.  If the array starts as all zeros, that is identical to the function 
putting a zero in the first element of the array and then returning, right?  
Perhaps we could mandate that the caller set all the array slots to a negative 
value before calling the function, but that seems like an awkward calling 
convention-- and certainly one that should be documented via JavaDoc.

{code}
  @Override
  protected void doDecode(DecoderState decoderState, ByteBuffer[] inputs,
                          int[] erasedIndexes, ByteBuffer[] outputs) {
{code}
I'm not sure why we wouldn't just store {{DecoderState}} in the {{Decoder}}?  
These are stateful objects, I assume.

Continuing my comments from earlier:
* {{AbstractRawErasureCoder}} -- why do we need this base class?  Its function 
seems to be just storing configuration values.  Perhaps we'd be better off just 
having an {{ErasureEncodingConfiguration}} class which other objects can own 
(not inherit from).  I think of a configuration as something you *own*, not 
something you *are*, which is why I think composition would make more sense 
here.  Also, is it possible for this to be immutable?  Mutable configuration is 
a huge headache (another reason I dislike {{Configured.java}})
* {{AbstractRawErasureEncoder}} /{{AbstractRawErasureDecoder}} -- why are these 
classes separate from {{RawErasureEncoder}} / {{RawErasureDecoder}}?  Do we 
expect that any encoders will implement {{RawErasureEncoder}}, but not extend 
{{AbstractRawErasureEncoder}}?  If not, it would be better just to have two 
base classes here rather than 2 classes and 2 interfaces.  Base classes are 
also easier to extend in the future than interfaces because you can add new 
methods without breaking backwards compatibility (as long as you have a default 
in the base).
* {{DummyRawDecoder}} -- {{NoOpRawDecoder}} would be a better name than 
"Dummy".  Is this intended to be used just in unit tests, or is it something 
the end-user should be able to configure?  If it is just unit tests, it should 
be under a {{test}} path, rather than a {{main}} path... i.e. 
{{hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/io/erasurecode/rawcoder/DummyRawDecoder.java}}


was (Author: cmccabe):
Thanks for this, [~drankye].  Looks good overall!

I like the idea of moving some of the utility stuff into {{CoderUtil.java}}.

{code}
  static byte[] getEmptyChunk(int leastLength) {
    if (emptyChunk.length >= leastLength) {
      return emptyChunk; // In most time
    }
    synchronized (AbstractRawErasureCoder.class) {
      emptyChunk = new byte[leastLength];
    }
    return emptyChunk;
  }
{code}
This isn't safe for multiple threads, since we could be reading 
{{CoderUtil#emptyChunk}} while it's in the middle of being written.  You must 
either make this {{volatile}} or hold the lock for this entire function.

It's unfortunate that we need a function like this-- I was hoping that there 
would be some more efficient way of zeroing a ByteBuffer.  One thing that's a 
little concerning here is that a caller could modify the array returned by 
{{getEmptyChunk}}, which would cause problems for other callers.  To avoid 
this, it's probably better to make this {{private}} to {{CoderUtil.java}}.

{code}
  static ByteBuffer convertInputBuffer(byte[] input, int offset, int len) {
{code}
Hmm.  This name seems a bit confusing.  What this function does has nothing to 
do with whether the buffer is for "input" versus "output"-- it's just copying 
data from an array to a {{DirectByteBuffer}}.  It's also not so much a 
"conversion" as a "copy".  Maybe something like {{cloneAsDirectByteBuffer}} 
would be a better name?

{code}
  static <T> int[] getErasedOrNotToReadIndexes(T[] inputs) {
{code}
Should be named {{getNullIndexes}}?

{code}
  static <T> T findFirstValidInput(T[] inputs) {
    if (inputs.length > 0 && inputs[0] != null) {
      return inputs[0];
    }

    for (T input : inputs) {
      if (input != null) {
        return input;
      }
    }
    ...
{code}
Why do we need the special case for the first element here?

{code}
  static <T> void makeValidIndexes(T[] inputs, int[] validIndexes) {
{code}
Should be named {{getNonNullIndexes}}?  Also, why does this one take an array 
passed in, whereas {{getNullIndexes}} returns an array?  I also don't see how 
the caller is supposed to know how many of the array slots were used by the 
function.  If the array starts as all zeros, that is identical to the function 
putting a zero in the first element of the array and then returning, right?  
Perhaps we could mandate that the caller set all the array slots to a negative 
value before calling the function, but that seems like an awkward calling 
convention-- and certainly one that should be documented via JavaDoc.

{code}
  @Override
  protected void doDecode(DecoderState decoderState, ByteBuffer[] inputs,
                          int[] erasedIndexes, ByteBuffer[] outputs) {
{code}
I'm not sure why we wouldn't just store {{DecoderState}} in the {{Decoder}}?  
These are stateful objects, I assume.

Continuing my comments from earlier:
* {{AbstractRawErasureCoder}} -- why do we need this base class?  Its function 
seems to be just storing configuration values.  Perhaps we'd be better off just 
having an {{ErasureEncodingConfiguration}} class which other objects can own 
(not inherit from).  I think of a configuration as something you *own*, not 
something you *are*, which is why I think composition would make more sense 
here.  Also, is it possible for this to be immutable?  Mutable configuration is 
a huge headache (another reason I dislike {{Configured.java}})
* {{AbstractRawErasure{En,De}coder}} -- why are these classes separate from 
{{RawErasureEncoder}} / {{RawErasureDecoder}}?  Do we expect that any encoders 
will implement {{RawErasureEncoder}}, but not extend 
{{AbstractRawErasureEncoder}}?  If not, it would be better just to have two 
base classes here rather than 2 classes and 2 interfaces.  Base classes are 
also easier to extend in the future than interfaces because you can add new 
methods without breaking backwards compatibility (as long as you have a default 
in the base).
* {{DummyRawDecoder}} -- {{NoOpRawDecoder}} would be a better name than 
"Dummy".  Is this intended to be used just in unit tests, or is it something 
the end-user should be able to configure?  If it is just unit tests, it should 
be under a {{test}} path, rather than a {{main}} path... i.e. 
{{hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/io/erasurecode/rawcoder/DummyRawDecoder.java}}

> Refactor raw erasure coders
> ---------------------------
>
>                 Key: HADOOP-13010
>                 URL: https://issues.apache.org/jira/browse/HADOOP-13010
>             Project: Hadoop Common
>          Issue Type: Sub-task
>            Reporter: Kai Zheng
>            Assignee: Kai Zheng
>             Fix For: 3.0.0
>
>         Attachments: HADOOP-13010-v1.patch, HADOOP-13010-v2.patch
>
>
> This will refactor raw erasure coders according to some comments received so 
> far.
> * As discussed in HADOOP-11540 and suggested by [~cmccabe], better not to 
> rely class inheritance to reuse the codes, instead they can be moved to some 
> utility.
> * Suggested by [~jingzhao] somewhere quite some time ago, better to have a 
> state holder to keep some checking results for later reuse during an 
> encode/decode call.
> This would not get rid of some inheritance levels as doing so isn't clear yet 
> for the moment and also incurs big impact. I do wish the end result by this 
> refactoring will make all the levels more clear and easier to follow.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (HADOOP-13010) Refactor raw erasure coders

Reply via email to