[
https://issues.apache.org/jira/browse/HADOOP-18215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17688800#comment-17688800
]
ASF GitHub Bot commented on HADOOP-18215:
-----------------------------------------
bbeaudreault commented on PR #4215:
URL: https://github.com/apache/hadoop/pull/4215#issuecomment-1430703431
Just want to clarify the use-case here (as a reminder and for the new
reviewers):
- When you write a SequenceFile, the key class and value class are encoded
in the format.
- When you go to read that file later, the SequenceFile.Reader gets those
key/value class names from the headers and tries to load the class so it can
parse the keys/values. This is where WritableName first comes in, with the
default being `Class.forName(keyClass)`.
- However, some SequenceFiles may be very old and the classes may have been
renamed. In this case, the above would throw ClassNotFoundException.
- This is why `WritableName.addName` exists, which allows you to specify
aliases pointing those old class names at whatever the new name is. When
`WritableName.getClass` is called it will check to see if an alias was
registered by calling addName prior to opening the SequenceFile. If so, it
returns that class.
This all worked when all key/value extend Writable, but Hadoop also supports
Serialization framework. You can specify `io.serializations` to register
serializations, and the SerializationFactory will try finding a serializer for
the key or value class. Serializations have an `boolean accept(Class)` method,
and one of the registered serializations need to return true for that.
So when the same "old sequence file contains a key or value class that has
been renamed" problem happens, if you are using Serialization you are out of
luck. By default you'd get a ClassNotFoundException, and if you tried doing
WritableName.addName, you'd get ClassCastException.
----
The simplest fix for that seemed to be in WritableName which appeared to be
IA.Private and have no real usages in the repo outside of SequenceFile. A one
line change there was attractive. The risk here seems pretty low, at least for
how SequenceFile uses this class.
If we have concerns here, there are other possible more involved solutions
we could discuss. For example, we could add something in SerializationFactory
to add aliases. This would be more involved though because it'd require a
slight refactor in SequenceFile and we'd have to make sure that new API worked
for any other usages of SerializationFactory.
That's why I chose the simple 1 liner approach, since it solves the problem
with simplicity and minimal external impact.
> Enhance WritableName to be able to return aliases for classes that use
> serializers
> ----------------------------------------------------------------------------------
>
> Key: HADOOP-18215
> URL: https://issues.apache.org/jira/browse/HADOOP-18215
> Project: Hadoop Common
> Issue Type: Improvement
> Reporter: Bryan Beaudreault
> Assignee: Bryan Beaudreault
> Priority: Minor
> Labels: pull-request-available
> Time Spent: 1.5h
> Remaining Estimate: 0h
>
> WritableName allows users shim in aliases for writables, in the case where a
> SequenceFile was written with a Writable class that has since been renamed or
> moved to another package. However, this requires that the aliased class
> extend Writable.
> Separately it's possible to configure jobs with keys and values which don't
> actually extend Writable. Instead they are meant to be
> serialized/deserialized using the serialization classes defined in
> {{io.serializations}} config.
> Unfortunately, the current implementation does not support these key/value
> classes. All we need to do to support this is remove the
> {{.asSubclass(Writable.class)}} as is already the case for the default.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]