RE: [PROPOSAL] DataFu for Incubation

Matthew Hayes Wed, 18 Dec 2013 16:18:02 -0800

When we came up with the name a couple years ago, it was inspired by "kung fu", 
in a playful way as Roman mentioned.  Sort of like saying your Java Fu or 
Python Fu is excellent.


-Matt
________________________________________
From: sebb [seb...@gmail.com]
Sent: Wednesday, December 18, 2013 3:57 PM
To: general@incubator.apache.org
Subject: Re: [PROPOSAL] DataFu for Incubation

On 18 December 2013 22:49, Matthew Hayes <mha...@linkedin.com> wrote:
> Hi all,
>
> I would like to share our draft ASF incubation proposal for DataFu, a library 
> that makes it easier to solve data problems in Hadoop and high level 
> languages based on it.

I am the only person to think that the last part of the name has
unfortunate connotations?
c.f. SNAFU which has the same last two characters.

> The proposal can be found here:
>
> https://wiki.apache.org/incubator/DataFuProposal
>
> The source code is available on GitHub:
>
> https://github.com/linkedin/datafu.
>
> The text of the proposal is copied below.  Feedback is appreciated!
>
> Thanks,
> Matt
>
> == Abstract ==
>
> Data``Fu makes it easier to solve data problems using Hadoop and higher level 
> languages based on it.
>
> == Proposal ==
>
> Data``Fu provides a collection of Hadoop Map``Reduce jobs and functions in 
> higher level languages based on it to perform data analysis.  It provides 
> functions for common statistics tasks (e.g. quantiles, sampling), Page``Rank, 
> stream sessionization, and set and bag operations.  Data``Fu also provides 
> Hadoop jobs for incremental data processing in Map``Reduce.
>
> == Background ==
>
> Data``Fu began two years ago as set of UDFs developed internally at 
> Linked``In, coming from our desire to solve common problems with reusable 
> components.  Recognizing that the community could benefit from such a 
> library, we added documentation, an extensive suite of unit tests, and open 
> sourced the code.  Since then there have been steady contributions to 
> Data``Fu as we encountered common problems not yet solved by it.  Others 
> outside Linked``In have contributed as well.  More recently we recognized the 
> challenges with efficient incremental processing of data in Hadoop and have 
> contributed a set of Hadoop Map``Reduce jobs as a solution.
>
> Data``Fu began as a project at Linked``In, but it has shown itself to be 
> useful to other organizations and developers as well as they have faced 
> similar problems.  We would like to share Data``Fu with the ASF and begin 
> developing a community of developers and users within Apache.
>
> == Rationale ==
>
> There is a strong need for well tested libraries that help developers solve 
> common data problems in Hadoop and higher level languages such as Pig, Hive, 
> Crunch, Scalding, etc.
>
> == Current Status ==
>
> === Meritocracy ===
>
> Our intent with this incubator proposal is to start building a diverse 
> developer community around Data``Fu following the Apache meritocracy model.  
> Since Data``Fu was initially open sourced in 2011, it has received 
> contributions from both within and outside Linked``In.  We plan to continue 
> support for new contributors and work with those who contribute significantly 
> to the project to make them committers.
>
> === Community ===
>
> Data``Fu has been building a community of developers for two years.  It began 
> with contributors from Linked``In and has received contributions from 
> developers at Cloudera since very early on.  It has been included included in 
> Cloudera’s Hadoop Distribution and Apache Bigtop.  We hope to extend our 
> contributor base significantly and invite all those who are interested in 
> solving large-scale data processing problems to participate.
>
> === Core Developers ===
>
> Data``Fu has a strong base of developers at Linked``In.  Matthew Hayes 
> initiated the project in 2011, and aside from continued contributions to 
> Data``Fu has also contributed the sub-project Hourglass for incremental 
> Map``Reduce processing.  Separate from Data``Fu he has also open sourced the 
> White Elephant project.  Sam Shah contributed a significant portion of the 
> original code and continues to contribute to the project.  William Vaughan 
> has been contributing regularly to Data``Fu for the past two years.  Evion 
> Kim has been contributing to Data``Fu for the past year.  Xiangrui Meng 
> recently contributed implementations of scalable sampling algorithms based on 
> research from a paper he published.  Chris Lloyd has provided some important 
> bug fixes and unit tests.  Mitul Tiwari has also contributed to Data``Fu.  
> Mathieu Bastian has been developing Map``Reduce jobs that we hope to include 
> in Data``Fu.  In addition he also leads the open source Gephi project.
>
> === Alignment ===
>
> The ASF is the natural choice to host the Data``Fu project as its goal of 
> encouraging community-driven open-source projects fits with our vision for 
> Data``Fu.  Additionally, other projects Data``Fu integrates with, such as 
> Apache Pig and Apache Hadoop, and in the future Apache Hive and Apache 
> Crunch, are hosted by the ASF and we will benefit and provide benefit by 
> close proximity to them.
>
> == Known Risks ==
>
> === Orphaned Products ===
>
> The core developers have been contributing to Data``Fu for the past two 
> years.  There is very little risk of Data``Fu being abandoned given its 
> widespread use within Linked``In.
>
> === Inexperience with Open Source ===
>
> Data``Fu was started as an open source project in 2011 and has remained so 
> for two years.  Matt initiated the project, and additionally is the creator 
> of the open source White Elephant project.  He has also contributed patches 
> to Apache Pig.  Most recently he has released Hourglass as a sub-project of 
> Data``Fu.  Sam contributed much of the original code and continues to 
> contribute to the project.  Will has been contributing to Data``Fu since it 
> was first open sourced.  Evion has been contributing for the past year.  
> Mathieu leads the open source Gephi project.  Jakob has been actively 
> involved with the ASF as a full-time Hadoop committer and PMC member.
>
> === Homogeneous Developers ===
>
> The current core developers are all from Linked``In.  Data``Fu has also 
> received contributions from other corporations such as Cloudera.  Two of 
> these developers are among the Initial Committers listed below.  We hope to 
> establish a developer community that includes contributors from several other 
> corporations and we are actively encouraging new contributors via 
> presentations and blog posts.
>
> === Reliance on Salaried Developers ===
>
> The current core developers are salaried employees of Linked``In, however 
> they are not paid specifically to work on Data``Fu.  Contributions to 
> Data``Fu arise from the developers solving problems they encounter in their 
> various projects.  The purpose of Data``Fu is to share these solutions so 
> that others may benefit and build a community of developers striving to solve 
> common problems together.  Furthermore, once the project has a community 
> built around it, we expect to get committers, developers and contributions 
> from outside the current core developers.
>
> === Relationships with Other Apache Products ===
>
> Data``Fu is deeply integrated with Apache products.  It began as a library of 
> user-defined functions for Apache Pig.  It has grown to also include Hadoop 
> jobs for incremental data processing and in the future will include code for 
> other higher level languages built on top of Apache Hadoop.
>
> === An Excessive Obsession with the Apache Brand ===
>
> While we respect the reputation of the Apache brand and have no doubts that 
> it will attract contributors and users, our interest is primarily to give 
> Data``Fu a solid home as an open source project following an established 
> development model.
>
> == Documentation ==
>
> Information on Data``Fu can be found at:
>
> https://github.com/LinkedIn/DataFu/blob/master/README.md
>
> == Initial Source ==
>
> The initial source is available at:
>
> https://github.com/LinkedIn/DataFu
>
> == Source and Intellectual Property Submission Plan ==
>
>  * The Data``Fu library source code, available on Git``Hub.
>
> == External Dependencies ==
>
> The initial source has the following external dependencies that are either 
> included in the final Data``Fu library or required in order to use it:
>
>  * fastutil (Apache 2.0)
>  * joda-time (Apache 2.0)
>  * commons-math (Apache 2.0)
>  * guava (Apache 2.0)
>  * stream (Apache 2.0)
>  * jsr-305 (BSD)
>  * log4j (Apache 2.0)
>  * json (The JSON License)
>  * avro (Apache 2.0)
>
> In addition, the following external libraries are used either in building, 
> developing, or testing the project:
>
>  * pig (Apache 2.0)
>  * hadoop (Apache 2.0)
>  * jline (BSD)
>  * antlr (BSD)
>  * commons-io (Apache 2.0)
>  * testng (Apache 2.0)
>  * maven (Apache 2.0)
>  * jsr-311 (CDDL-1.0)
>  * slf4j (MIT)
>  * eclipse (Eclipse Public License 1.0)
>  * autojar (GPLv2)
>  * jarjar (Apache 2.0)
>
> == Cryptography ==
>
> Data``Fu has user-defined functions that use MD5 and SHA provided by Java’s 
> java.security.Message``Digest.
>
> == Required Resources ==
>
> === Mailing Lists ===
>
> Data``Fu-private for private PMC discussions (with moderated subscriptions) 
> Data``Fu-dev Data``Fu-commits
>
> === Subversion Directory ===
>
> Git is the preferred source control system: git://git.apache.org/DataFu
>
> === Issue Tracking ===
>
> JIRA Data``Fu (Data``Fu)
>
> === Other Resources ===
>
> The existing code already has unit tests, so we would like a Hudson instance 
> to run them whenever a new patch is submitted. This can be added after 
> project creation.
>
> == Initial Committers ==
>
>  * Matthew Hayes
>  * William Vaughan
>  * Evion Kim
>  * Sam Shah
>  * Xiangrui Meng
>  * Christopher Lloyd
>  * Mathieu Bastian
>  * Mitul Tiwari
>  * Josh Wills
>  * Jarek Jarcec Cecho
>
> == Affiliations ==
>
>  * Matthew Hayes (Linked``In)
>  * William Vaughan (Linked``In)
>  * Evion Kim (Linked``In)
>  * Sam Shah (Linked``In)
>  * Xiangrui Meng (Linked``In)
>  * Christopher Lloyd (Linked``In)
>  * Mathieu Bastian (Linked``In)
>  * Mitul Tiwari (Linked``In)
>  * Josh Wills (Cloudera)
>  * Jarek Jarcec Cecho (Cloudera)
>
> == Sponsors ==
>
> === Champion ===
>
> Jakob Homan (Apache Member)
>
> === Nominated Mentors ===
>
>  * Ashutosh Chauhan <hashutosh at apache dot org>
>  * Roman Shaposhnik <rvs at apache dot org>
>  * Ted Dunning <tdunning at apache dot org>
>
> === Sponsoring Entity ===
>
> We are requesting the Incubator to sponsor this project.
>

---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org

RE: [PROPOSAL] DataFu for Incubation

Reply via email to