[
https://issues.apache.org/jira/browse/HADOOP-13010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15241679#comment-15241679
]
Colin Patrick McCabe edited comment on HADOOP-13010 at 4/14/16 6:29 PM:
------------------------------------------------------------------------
Thanks for this, [~drankye]. Looks good overall!
I like the idea of moving some of the utility stuff into {{CoderUtil.java}}.
{code}
static byte[] getEmptyChunk(int leastLength) {
if (emptyChunk.length >= leastLength) {
return emptyChunk; // In most time
}
synchronized (AbstractRawErasureCoder.class) {
emptyChunk = new byte[leastLength];
}
return emptyChunk;
}
{code}
This isn't safe for multiple threads, since we could be reading
{{CoderUtil#emptyChunk}} while it's in the middle of being written. You must
either make this {{volatile}} or hold the lock for this entire function.
It's unfortunate that we need a function like this-- I was hoping that there
would be some more efficient way of zeroing a ByteBuffer. One thing that's a
little concerning here is that a caller could modify the array returned by
{{getEmptyChunk}}, which would cause problems for other callers. To avoid
this, it's probably better to make this {{private}} to {{CoderUtil.java}}.
{code}
static ByteBuffer convertInputBuffer(byte[] input, int offset, int len) {
{code}
Hmm. This name seems a bit confusing. What this function does has nothing to
do with whether the buffer is for "input" versus "output"-- it's just copying
data from an array to a {{DirectByteBuffer}}. It's also not so much a
"conversion" as a "copy". Maybe something like {{cloneAsDirectByteBuffer}}
would be a better name?
{code}
static <T> int[] getErasedOrNotToReadIndexes(T[] inputs) {
{code}
Should be named {{getNullIndexes}}?
{code}
static <T> T findFirstValidInput(T[] inputs) {
if (inputs.length > 0 && inputs[0] != null) {
return inputs[0];
}
for (T input : inputs) {
if (input != null) {
return input;
}
}
...
{code}
Why do we need the special case for the first element here?
{code}
static <T> void makeValidIndexes(T[] inputs, int[] validIndexes) {
{code}
Should be named {{getNonNullIndexes}}? Also, why does this one take an array
passed in, whereas {{getNullIndexes}} returns an array? I also don't see how
the caller is supposed to know how many of the array slots were used by the
function. If the array starts as all zeros, that is identical to the function
putting a zero in the first element of the array and then returning, right?
Perhaps we could mandate that the caller set all the array slots to a negative
value before calling the function, but that seems like an awkward calling
convention-- and certainly one that should be documented via JavaDoc.
{code}
@Override
protected void doDecode(DecoderState decoderState, ByteBuffer[] inputs,
int[] erasedIndexes, ByteBuffer[] outputs) {
{code}
I'm not sure why we wouldn't just store {{DecoderState}} in the {{Decoder}}?
These are stateful objects, I assume.
Continuing my comments from earlier:
* {{AbstractRawErasureCoder}} -- why do we need this base class? Its function
seems to be just storing configuration values. Perhaps we'd be better off just
having an {{ErasureEncodingConfiguration}} class which other objects can own
(not inherit from). I think of a configuration as something you *own*, not
something you *are*, which is why I think composition would make more sense
here. Also, is it possible for this to be immutable? Mutable configuration is
a huge headache (another reason I dislike {{Configured.java}})
* {{AbstractRawErasureEncoder}} /{{AbstractRawErasureDecoder}} -- why are these
classes separate from {{RawErasureEncoder}} / {{RawErasureDecoder}}? Do we
expect that any encoders will implement {{RawErasureEncoder}}, but not extend
{{AbstractRawErasureEncoder}}? If not, it would be better just to have two
base classes here rather than 2 classes and 2 interfaces. Base classes are
also easier to extend in the future than interfaces because you can add new
methods without breaking backwards compatibility (as long as you have a default
in the base).
* {{DummyRawDecoder}} -- {{NoOpRawDecoder}} would be a better name than
"Dummy". Is this intended to be used just in unit tests, or is it something
the end-user should be able to configure? If it is just unit tests, it should
be under a {{test}} path, rather than a {{main}} path... i.e.
{{hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/io/erasurecode/rawcoder/DummyRawDecoder.java}}
was (Author: cmccabe):
Thanks for this, [~drankye]. Looks good overall!
I like the idea of moving some of the utility stuff into {{CoderUtil.java}}.
{code}
static byte[] getEmptyChunk(int leastLength) {
if (emptyChunk.length >= leastLength) {
return emptyChunk; // In most time
}
synchronized (AbstractRawErasureCoder.class) {
emptyChunk = new byte[leastLength];
}
return emptyChunk;
}
{code}
This isn't safe for multiple threads, since we could be reading
{{CoderUtil#emptyChunk}} while it's in the middle of being written. You must
either make this {{volatile}} or hold the lock for this entire function.
It's unfortunate that we need a function like this-- I was hoping that there
would be some more efficient way of zeroing a ByteBuffer. One thing that's a
little concerning here is that a caller could modify the array returned by
{{getEmptyChunk}}, which would cause problems for other callers. To avoid
this, it's probably better to make this {{private}} to {{CoderUtil.java}}.
{code}
static ByteBuffer convertInputBuffer(byte[] input, int offset, int len) {
{code}
Hmm. This name seems a bit confusing. What this function does has nothing to
do with whether the buffer is for "input" versus "output"-- it's just copying
data from an array to a {{DirectByteBuffer}}. It's also not so much a
"conversion" as a "copy". Maybe something like {{cloneAsDirectByteBuffer}}
would be a better name?
{code}
static <T> int[] getErasedOrNotToReadIndexes(T[] inputs) {
{code}
Should be named {{getNullIndexes}}?
{code}
static <T> T findFirstValidInput(T[] inputs) {
if (inputs.length > 0 && inputs[0] != null) {
return inputs[0];
}
for (T input : inputs) {
if (input != null) {
return input;
}
}
...
{code}
Why do we need the special case for the first element here?
{code}
static <T> void makeValidIndexes(T[] inputs, int[] validIndexes) {
{code}
Should be named {{getNonNullIndexes}}? Also, why does this one take an array
passed in, whereas {{getNullIndexes}} returns an array? I also don't see how
the caller is supposed to know how many of the array slots were used by the
function. If the array starts as all zeros, that is identical to the function
putting a zero in the first element of the array and then returning, right?
Perhaps we could mandate that the caller set all the array slots to a negative
value before calling the function, but that seems like an awkward calling
convention-- and certainly one that should be documented via JavaDoc.
{code}
@Override
protected void doDecode(DecoderState decoderState, ByteBuffer[] inputs,
int[] erasedIndexes, ByteBuffer[] outputs) {
{code}
I'm not sure why we wouldn't just store {{DecoderState}} in the {{Decoder}}?
These are stateful objects, I assume.
Continuing my comments from earlier:
* {{AbstractRawErasureCoder}} -- why do we need this base class? Its function
seems to be just storing configuration values. Perhaps we'd be better off just
having an {{ErasureEncodingConfiguration}} class which other objects can own
(not inherit from). I think of a configuration as something you *own*, not
something you *are*, which is why I think composition would make more sense
here. Also, is it possible for this to be immutable? Mutable configuration is
a huge headache (another reason I dislike {{Configured.java}})
* {{AbstractRawErasure{En,De}coder}} -- why are these classes separate from
{{RawErasureEncoder}} / {{RawErasureDecoder}}? Do we expect that any encoders
will implement {{RawErasureEncoder}}, but not extend
{{AbstractRawErasureEncoder}}? If not, it would be better just to have two
base classes here rather than 2 classes and 2 interfaces. Base classes are
also easier to extend in the future than interfaces because you can add new
methods without breaking backwards compatibility (as long as you have a default
in the base).
* {{DummyRawDecoder}} -- {{NoOpRawDecoder}} would be a better name than
"Dummy". Is this intended to be used just in unit tests, or is it something
the end-user should be able to configure? If it is just unit tests, it should
be under a {{test}} path, rather than a {{main}} path... i.e.
{{hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/io/erasurecode/rawcoder/DummyRawDecoder.java}}
> Refactor raw erasure coders
> ---------------------------
>
> Key: HADOOP-13010
> URL: https://issues.apache.org/jira/browse/HADOOP-13010
> Project: Hadoop Common
> Issue Type: Sub-task
> Reporter: Kai Zheng
> Assignee: Kai Zheng
> Fix For: 3.0.0
>
> Attachments: HADOOP-13010-v1.patch, HADOOP-13010-v2.patch
>
>
> This will refactor raw erasure coders according to some comments received so
> far.
> * As discussed in HADOOP-11540 and suggested by [~cmccabe], better not to
> rely class inheritance to reuse the codes, instead they can be moved to some
> utility.
> * Suggested by [~jingzhao] somewhere quite some time ago, better to have a
> state holder to keep some checking results for later reuse during an
> encode/decode call.
> This would not get rid of some inheritance levels as doing so isn't clear yet
> for the moment and also incurs big impact. I do wish the end result by this
> refactoring will make all the levels more clear and easier to follow.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)