[PROPOSAL] Sqoop Project

arv...@cloudera.com Fri, 27 May 2011 11:40:55 -0700

Greetings All,

We would like to propose Sqoop Project for inclusion in ASF Incubator as a
new podling. Sqoop is a tool designed for efficiently transferring bulk data
between Apache Hadoop and structured datastores such as relational
databases. The complete proposal can be found at:


http://wiki.apache.org/incubator/SqoopProposal

The initial contents of this proposal are also pasted below for convenience.

Thanks and Regards,
Arvind Prabhakar

= Sqoop - A Data Transfer Tool for Hadoop =

== Abstract ==

Sqoop is a tool designed for efficiently transferring bulk data between
Apache Hadoop and structured datastores such as relational databases. You
can use Sqoop to import data from external structured datastores into Hadoop
Distributed File System or related systems like Hive and HBase. Conversely,
Sqoop can be used to extract data from Hadoop and export it to external
structured datastores such as relational databases and enterprise data
warehouses.

== Proposal ==

Hadoop and related systems operate on large volumes of data. Typically this
data originates from outside of Hadoop infrastructure and must be
provisioned for consumption by Hadoop and related systems for analysis and
processing. Sqoop allows fast provisioning of data into Hadoop and related
systems by providing a bulk import and export mechanism that enables
consumers to effectively use Hadoop for data analysis and processing.

== Background ==

Sqoop was initially developed by Cloudera to enable the import and export of
data between various databases and Hadoop Distributed File System (HDFS). It
was provided as a patch to Hadoop project via the issue [[
https://issues.apache.org/jira/browse/HADOOP-5815|HADOOP-5815]] and was
maintained as a contrib module to Hadoop between May 2009 to April 2010. In
April 2010, Sqoop was removed from Hadoop contrib via [[
https://issues.apache.org/jira/browse/MAPREDUCE-1644|MAPREDUCE-1644]] and
was made available by Cloudera on [[http://github.com/cloudera/sqoop|GitHub]].


Since then Sqoop has been maintained by Cloudera as an open source project
on GitHub. All code available in Sqoop is open source and made publicaly
available under the Apache 2 license. During this time Sqoop has been
formally released three times as versions 1.0, 1.1 and 1.2.

== Rationale ==

Hadoop is often used to process data that originated or is later served by
structured data stores such as relational databases, spreadsheets or
enterprise data warehouses. Unfortunately, current methods of transferring
data are inefficient and ad hoc, often consisting of manual steps specific
to the external system. These steps are necessary to help provision this
data for consumption by Map-Reduce jobs, or by systems that build on top of
Hadoop such as Hive and Pig. The transfer of this data can take substantial
amount of time depending upon its size. An optimal transfer approach that
works well with one particular datastore will typically not work as
optimally with another datastore due to inherent architectural differences
between different datastore implementations. Sqoop addresses this problem by
providing connectivity of Hadoop with external systems via pluggable
connectors. Specialized connectors are developed for optimal performance for
data transfer between Hadoop and target systems.

Analyzed and processed data from Hadoop and related systems may also require
to be provisioned outside of Hadoop for consumption by business
applications. Sqoop allows the export of data from Hadoop to external
systems to facilitate its use in other systems. This too, like the import
scenario, is implemented via specialized connectors that are built for the
purposes of optimal integration between Hadoop and external systems.

Connectors can be built for systems that Sqoop does not yet integrate with
and thus can be easily incorporated into Sqoop. Connectors allow Sqoop to
interface with external systems of different types, ensuring that newer
systems can integrate with Hadoop with relative ease and in a consistent
manner.

Besides allowing integration with other external systems, Sqoop provides
tight integration with systems that build on to of Hadoop such as Hive,
HBase etc - thus providing data integration between Hadoop based systems and
external systems in a single step manner.

== Initial Goals ==

Sqoop is currently in its first major release with a considerable number of
enhancement requests, tasks, and issues logged towards its future
development. The initial goal of this project will be to address the highly
requested features and bug-fixes towards its next dot release. The key
features of interest are the following:
 * Support for bulk import into Apache HBase.
 * Allow user to supply password in permission protected file.
 * Support for pluggable query to help Sqoop identify the metadata
associated with the source or target table definitions.
 * Allow user to specify custom split semantics for efficient
parallelization of import jobs.

= Current Status =

== Meritocracy ==

Sqoop has been an open source project since its start. It was initially
developed by Aaron Kimball in May 2009 along with development team at
Cloudera and supplied as a patch to Hadoop project. Later it was moved to
GitHub as a Cloudera open-source project where Cloudera engineering team has
since maintained it with Arvind Prabhakar and Ahmed Radwan dedicated towards
its improvement. Developers external to Cloudera provided feedback,
suggested features and fixes and implemented extensions of Sqoop since its
inception.  Contributors to Sqoop include developers from different
companies and different parts of the world.

== Community ==

Sqoop is currently used by a number of organizations all over the world.
Sqoop has an active and growing user community with active participation in
[[https://groups.google.com/a/cloudera.org/group/sqoop-user/topics|user]]
and [[
https://groups.google.com/a/cloudera.org/group/sqoop-dev/topics|developer]]
mailing lists.

== Core Developers ==

The core developers for Sqoop project are:
 * Aaron Kimball: Aaron designed and implemented much of the original code.
 * Arvind Prabhakar: Has been working on Sqoop features and bug fixes.
 * Ahmed Radwan: Has been working on Sqoop features and bug fixes.
 * Jonathan Hsieh: Has started working on Sqoop features and bug fixes.
 * Other contributors to the project include: Angus He, Brian Muller, Eli
Collins, Guy Le Mar, James Grant, Konstantin Boudnik, Lars Francke, Michael
Hausler, Michael Katzenellenbogen, Pter Happ and Scott Foster.

All committers to Sqoop project have contributed towards Hadoop or related
Apache projects and are very familiar with Apache principals and philosophy
for community driven software development.

== Alignment ==

Sqoop complements Hadoop Map-Reduce, Pig, Hive, HBase by providing a robust
mechanism to allow data integration from external systems for effective data
analysis. It integrates with Hive and HBase currently and work is being done
to integrate it with Pig.

= Known Risks =

== Orphaned Products ==

Sqoop is already deployed in production at multiple companies and they are
actively participating in feature requests and user led discussions. Sqoop
is getting traction with developers and thus the risks of it being orphaned
are minimal.

== Inexperience with Open Source ==

All code developed for Sqoop has been open source from the start. The
initial part of Sqoop development was done within Hadoop project as a
contrib module. Since then it has been maintained as an Apache 2.0 licensed
open-source project on GitHub by Cloudera.

All committers of Sqoop project are intimately familiar with the Apache
model for open-source development and are experienced with working with new
contributors. Aaron Kimball, the creator of the project and one of the
committers is also a committer on Apache MapReduce.

== Homogeneous Developers ==

The initial set of committers is from a small set of organizations. However,
we expect that once approved for incubation, the project will attract new
contributors from diverse organizations and will thus grow organically. The
participation of developers from several different organizations in the
mailing list is a strong indication for this assertion.

== Reliance on Salaried Developers ==

It is expected that Sqoop will be developed on salaried and volunteer time,
although all of the initial developers will work on it mainly on salaried
time.

== Relationships with Other Apache Products ==

Sqoop depends upon other Apache Projects: Hadoop, Hive, HBase Log4J and
multiple Apache commons components and build systems like Ant and Maven.

== An Excessive Fascination with the Apache Brand ==

The reasons for joining Apache are to increase the synergy with other Apache
Hadoop related projects and to foster a healthy community of contributors
and consumers around the project. This is facilitated by ASF and that is the
primary reason we would like Sqoop to become an Apache project.

= Documentation =

 * All Sqoop documentation is maintained within Sqoop sources and can be
built directly.
 * Sqoop docs: http://archive.cloudera.com/cdh/3/sqoop/
 * Sqoop wiki at GitHub: https://github.com/cloudera/sqoop/wiki
 * Sqoop jira at Cloudera: https://issues.cloudera.org/browse/sqoop

= Initial Source =

 * https://github.com/cloudera/sqoop/tree/

== Source and Intellectual Property Submission Plan ==

 * The initial source is already Apache 2.0 licensed.

== External Dependencies ==

The required external dependencies are all Apache License or compatible
licenses. Following components with non-Apache licenses are enumerated:

 * HSQLDB: HSQLDB License - a BSD-based license.

Non-Apache build tools that are used by Sqoop are as follows:

 * AsciiDoc: GNU GPLv2
 * Checkstyle: GNU LGPLv3
 * FindBugs: GNU LGPL
 * Cobertura: GNU GPLv2

== Cryptography ==

Sqoop does not depend upon any cryptography tools or libraries.

= Required  Resources =

== Mailing lists ==

 * sqoop-private (with moderated subscriptions)
 * sqoop-dev
 * sqoop-commits
 * sqoop-user

== Subversion Directory ==

https://svn.apache.org/repos/asf/incubator/sqoop

== Issue Tracing ==

JIRA Sqoop (SQOOP)

== Other Resources ==

The existing code already has unit and integration tests so we would like a
Hudson instance to run them whenever a new patch is submitted. This can be
added after project creation.

= Initial Committers =

 * Arvind Prabhakar (arvind at cloudera dot com)
 * Ahmed Radwan (a dot aboelela at gmail dot com)
 * Jonathan Hsieh (jon at cloudera dot com)
 * Aaron Kimball (kimballa at apache dot org)
 * Greg Cottman (greg dot cottman at quest dot com)
 * Guy le Mar (guy dot lemar at quest dot com)
 * Roman Shaposhnik (rvs at cloudera dot com)
 * Andrew Bayer (andrew at cloudera dot com)

A CLA is already on file for Aaron Kimball.

= Affiliations =

 * Arvind Prabhakar, Cloudera
 * Ahmed Radwan, Cloudera
 * Jonathan Hsieh, Cloudera
 * Aaron Kimball, Odiago
 * Greg Cottman, Quest
 * Guy le Mar, Quest
 * Roman Shaposhnik, Cloudera
 * Andrew Bayer, Cloudera

= Sponsors =

== Champion ==

 * Tom White (tomwhite at apache dot org)

== Nominated Mentors ==

 * Patrick Hunt (phunt at apache dot org)

== Sponsoring Entity ==

 * Apache Incubator PMC

[PROPOSAL] Sqoop Project

Reply via email to