[ 
https://issues.apache.org/jira/browse/HADOOP-14444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16088931#comment-16088931
 ] 

Steve Loughran commented on HADOOP-14444:
-----------------------------------------

I'm watching this, but I'm going offline until Aug and won't be going near 
JIRA. If others watching can D/L and play with it, that'd be good.

Things to look for
* All the obscure FTP servers, esp those non-ASCII ones (AS/400, etc.). Do they 
work? If not, does it matter.
* Talking to Windows FTP servers, even if Active Directory Auth doesn't work 
(unlikely), does it fail meaningfully
* resilience over long-haul links, large file ops
* failure handling & reporting
* docs & improvements there

Also worth considering, migration strategy. 
# Given the s3a experience, it's good to ship new and old side by side. But we 
may want to declare this one the default, allowing people to switch back if 
they do have problems.
# How will things break if the ftp dependencies are pulled from hadoop-common?



> New implementation of ftp and sftp filesystems
> ----------------------------------------------
>
>                 Key: HADOOP-14444
>                 URL: https://issues.apache.org/jira/browse/HADOOP-14444
>             Project: Hadoop Common
>          Issue Type: New Feature
>          Components: fs
>    Affects Versions: 2.8.0
>            Reporter: Lukas Waldmann
>            Assignee: Lukas Waldmann
>         Attachments: HADOOP-14444.2.patch, HADOOP-14444.3.patch, 
> HADOOP-14444.4.patch, HADOOP-14444.5.patch, HADOOP-14444.6.patch, 
> HADOOP-14444.patch
>
>
> Current implementation of FTP and SFTP filesystems have severe limitations 
> and performance issues when dealing with high number of files. Mine patch 
> solve those issues and integrate both filesystems such a way that most of the 
> core functionality is common for both and therefore simplifying the 
> maintainability.
> The core features:
> * Support for HTTP/SOCKS proxies
> * Support for passive FTP
> * Support of connection pooling - new connection is not created for every 
> single command but reused from the pool.
> For huge number of files it shows order of magnitude performance improvement 
> over not pooled connections.
> * Caching of directory trees. For ftp you always need to list whole directory 
> whenever you ask information about particular file.
> Again for huge number of files it shows order of magnitude performance 
> improvement over not cached connections.
> * Support of keep alive (NOOP) messages to avoid connection drops
> * Support for Unix style or regexp wildcard glob - useful for listing a 
> particular files across whole directory tree
> * Support for reestablishing broken ftp data transfers - can happen 
> surprisingly often



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to