Gera Shegalov created HADOOP-12077:
--------------------------------------
Summary: Provide a muti-URI replication Inode for ViewFs
Key: HADOOP-12077
URL: https://issues.apache.org/jira/browse/HADOOP-12077
Project: Hadoop Common
Issue Type: New Feature
Components: fs
Reporter: Gera Shegalov
Assignee: Gera Shegalov
This JIRA is to provide simple "replication" capabilities for applications that
maintain logically equivalent paths in multiple locations for caching or
failover (e.g., S3 and HDFS). We noticed a simple common HDFS usage pattern in
our applications. They host their data on some logical cluster C. There are
corresponding HDFS clusters in multiple datacenters. When the application runs
in DC1, it prefers to read from C in DC1, and the applications prefers to
failover to C in DC2 if the application is migrated to DC2 or when C in DC1 is
unavailable. New application data versions are created periodically/relatively
infrequently.
In order to address many common scenarios in a general fashion, and to avoid
unnecessary code duplication, we implement this functionality in ViewFs (our
default FileSystem spanning all clusters in all datacenters) in a project
code-named Nfly (N as in N datacenters). Currently each ViewFs Inode points to
a single URI via ChRootedFileSystem. Consequently, we introduce a new type of
links that points to a list of URIs that are each going to be wrapped in
ChRootedFileSystem. A typical usage: /nfly/C/user->/DC1/C/user,/DC2/C/user,...
This collection of ChRootedFileSystem instances is fronted by the Nfly
filesystem object that is actually used for the mount point/Inode. Nfly
filesystems backs a single logical path /nfly/C/user/<user>/path by multiple
physical paths.
Nfly filesystem supports setting minReplication. As long as the number of URIs
on which an update has succeeded is greater than or equal to minReplication
exceptions are only logged but not thrown. Each update operation is currently
executed serially (client-bandwidth driven parallelism will be added later).
A file create/write:
# Creates a temporary invisible _nfly_tmp_file in the intended chrooted
filesystem.
# Returns a FSDataOutputStream that wraps output streams returned by 1
# All writes are forwarded to each output stream.
# On close of stream created by 2, all n streams are closed, and the files are
renamed from _nfly_tmp_file to file. All files receive the same mtime
corresponding to the client system time as of beginning of this step.
# If at least minReplication destinations has gone through steps 1-4 without
failures the transaction is considered logically committed, otherwise a
best-effort attempt of cleaning up the temporary files is attempted.
As for reads, we support a notion of locality similar to HDFS /DC/rack/node.
We sort Inode URIs using NetworkTopology by their authorities. These are
typically host names in simple HDFS URIs. If the authority is missing as is the
case with the local file:/// the local host name is assumed
InetAddress.getLocalHost(). This makes sure that the local file system is
always the closest one to the reader in this approach. For our Hadoop 2 hdfs
URIs that are based on nameservice ids instead of hostnames it is very easy to
adjust the topology script since our nameservice ids already contain the
datacenter. As for rack and node we can simply output any string such as
/DC/rack-nsid/node-nsid, since we only care about datacenter-locality for such
filesystem clients.
There are 2 policies/additions to the read call path that makes it more
expensive, but improve user experience:
- readMostRecent - when this policy is enabled, Nfly first checks mtime for the
path under all URIs, sorts them from most recent to least recent. Nfly then
sorts the set of most recent URIs topologically in the same manner as described
above.
- repairOnRead - when readMostRecent is enabled Nfly already has to RPC all
underlying destinations. With repairOnRead, Nfly filesystem would additionally
attempt to refresh destinations with the path missing or a stale version of the
path using the nearest available most recent destination.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)