[
https://issues.apache.org/jira/browse/HADOOP-17072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Virajith Jalaparti updated HADOOP-17072:
----------------------------------------
Description:
In a federated setting (HDFS federation, federation across multiple buckets on
S3, multiple containers across Azure storage), certain system tools/pipelines
require the ability to map paths to the clusters/accounts.
Consider the example of GDPR compliance/retention jobs that need to go over
various datasets, ingested over a period of T days and remove/quarantine
datasets that are not properly annotated/have reached their retention period.
Such jobs can rely on renames to a global trash/quarantine directory to
accomplish their task. However, in a federated setting, efficient, atomic
renames (as those within a single HDFS cluster) are not supported across the
different clusters/shards in federation. As a result, such jobs will need to
leverage a trash/quarantine directory per cluster/shard. Further, they would
need to map from a particular path to the cluster/shard that contains this path.
To address such cases, this JIRA proposes to get add two new methods to
{{FileSystem}}: {{getClusterRoot}} and {{getClusterRoots()}}.
was:
In a federated setting (HDFS federation, federation across multiple buckets on
S3, multiple containers across Azure storage), certain system tools/pipelines
require the ability to map paths to the clusters/accounts.
Consider GDPR compliance/retention jobs need to go over the datasets ingested
over a period of T days and remove/quarantine datasets that are not properly
annotated/have reached their retention period. Such jobs can rely on renames to
a global trash/quarantine directory to accomplish their task. However, in a
federated setting, efficient, atomic renames (as those within a single HDFS
cluster) are not supported across the different clusters/shards in federation.
As a result, such jobs will need to get the clusters to which different paths
map to.
To address such cases, this JIRA proposes to get add two new methods to
{{FileSystem}}: {{getClusterRoot}} and {{getClusterRoots()}}.
> Add getClusterRoot and getClusterRoots methods to FileSystem and
> ViewFilesystem
> -------------------------------------------------------------------------------
>
> Key: HADOOP-17072
> URL: https://issues.apache.org/jira/browse/HADOOP-17072
> Project: Hadoop Common
> Issue Type: Task
> Components: fs, viewfs
> Reporter: Virajith Jalaparti
> Assignee: Virajith Jalaparti
> Priority: Major
>
> In a federated setting (HDFS federation, federation across multiple buckets
> on S3, multiple containers across Azure storage), certain system
> tools/pipelines require the ability to map paths to the clusters/accounts.
> Consider the example of GDPR compliance/retention jobs that need to go over
> various datasets, ingested over a period of T days and remove/quarantine
> datasets that are not properly annotated/have reached their retention period.
> Such jobs can rely on renames to a global trash/quarantine directory to
> accomplish their task. However, in a federated setting, efficient, atomic
> renames (as those within a single HDFS cluster) are not supported across the
> different clusters/shards in federation. As a result, such jobs will need to
> leverage a trash/quarantine directory per cluster/shard. Further, they would
> need to map from a particular path to the cluster/shard that contains this
> path.
> To address such cases, this JIRA proposes to get add two new methods to
> {{FileSystem}}: {{getClusterRoot}} and {{getClusterRoots()}}.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]