Re: [PROPOSAL] DataFu for Incubation

sebb Wed, 18 Dec 2013 17:42:04 -0800

OK, but the first association that came to my mind was SNAFU - perhaps
because it shares the last 3 letters with DataFu.


I just thought you ought to be aware that the name could have negative
connotations for some people.

On 19 December 2013 00:16, Matthew Hayes <mha...@linkedin.com> wrote:
> When we came up with the name a couple years ago, it was inspired by "kung 
> fu", in a playful way as Roman mentioned.  Sort of like saying your Java Fu 
> or Python Fu is excellent.
>
> -Matt
> ________________________________________
> From: sebb [seb...@gmail.com]
> Sent: Wednesday, December 18, 2013 3:57 PM
> To: general@incubator.apache.org
> Subject: Re: [PROPOSAL] DataFu for Incubation
>
> On 18 December 2013 22:49, Matthew Hayes <mha...@linkedin.com> wrote:
>> Hi all,
>>
>> I would like to share our draft ASF incubation proposal for DataFu, a 
>> library that makes it easier to solve data problems in Hadoop and high level 
>> languages based on it.
>
> I am the only person to think that the last part of the name has
> unfortunate connotations?
> c.f. SNAFU which has the same last two characters.
>
>> The proposal can be found here:
>>
>> https://wiki.apache.org/incubator/DataFuProposal
>>
>> The source code is available on GitHub:
>>
>> https://github.com/linkedin/datafu.
>>
>> The text of the proposal is copied below.  Feedback is appreciated!
>>
>> Thanks,
>> Matt
>>
>> == Abstract ==
>>
>> Data``Fu makes it easier to solve data problems using Hadoop and higher 
>> level languages based on it.
>>
>> == Proposal ==
>>
>> Data``Fu provides a collection of Hadoop Map``Reduce jobs and functions in 
>> higher level languages based on it to perform data analysis.  It provides 
>> functions for common statistics tasks (e.g. quantiles, sampling), 
>> Page``Rank, stream sessionization, and set and bag operations.  Data``Fu 
>> also provides Hadoop jobs for incremental data processing in Map``Reduce.
>>
>> == Background ==
>>
>> Data``Fu began two years ago as set of UDFs developed internally at 
>> Linked``In, coming from our desire to solve common problems with reusable 
>> components.  Recognizing that the community could benefit from such a 
>> library, we added documentation, an extensive suite of unit tests, and open 
>> sourced the code.  Since then there have been steady contributions to 
>> Data``Fu as we encountered common problems not yet solved by it.  Others 
>> outside Linked``In have contributed as well.  More recently we recognized 
>> the challenges with efficient incremental processing of data in Hadoop and 
>> have contributed a set of Hadoop Map``Reduce jobs as a solution.
>>
>> Data``Fu began as a project at Linked``In, but it has shown itself to be 
>> useful to other organizations and developers as well as they have faced 
>> similar problems.  We would like to share Data``Fu with the ASF and begin 
>> developing a community of developers and users within Apache.
>>
>> == Rationale ==
>>
>> There is a strong need for well tested libraries that help developers solve 
>> common data problems in Hadoop and higher level languages such as Pig, Hive, 
>> Crunch, Scalding, etc.
>>
>> == Current Status ==
>>
>> === Meritocracy ===
>>
>> Our intent with this incubator proposal is to start building a diverse 
>> developer community around Data``Fu following the Apache meritocracy model.  
>> Since Data``Fu was initially open sourced in 2011, it has received 
>> contributions from both within and outside Linked``In.  We plan to continue 
>> support for new contributors and work with those who contribute 
>> significantly to the project to make them committers.
>>
>> === Community ===
>>
>> Data``Fu has been building a community of developers for two years.  It 
>> began with contributors from Linked``In and has received contributions from 
>> developers at Cloudera since very early on.  It has been included included 
>> in Cloudera’s Hadoop Distribution and Apache Bigtop.  We hope to extend our 
>> contributor base significantly and invite all those who are interested in 
>> solving large-scale data processing problems to participate.
>>
>> === Core Developers ===
>>
>> Data``Fu has a strong base of developers at Linked``In.  Matthew Hayes 
>> initiated the project in 2011, and aside from continued contributions to 
>> Data``Fu has also contributed the sub-project Hourglass for incremental 
>> Map``Reduce processing.  Separate from Data``Fu he has also open sourced the 
>> White Elephant project.  Sam Shah contributed a significant portion of the 
>> original code and continues to contribute to the project.  William Vaughan 
>> has been contributing regularly to Data``Fu for the past two years.  Evion 
>> Kim has been contributing to Data``Fu for the past year.  Xiangrui Meng 
>> recently contributed implementations of scalable sampling algorithms based 
>> on research from a paper he published.  Chris Lloyd has provided some 
>> important bug fixes and unit tests.  Mitul Tiwari has also contributed to 
>> Data``Fu.  Mathieu Bastian has been developing Map``Reduce jobs that we hope 
>> to include in Data``Fu.  In addition he also leads the open source Gephi 
>> project.
>>
>> === Alignment ===
>>
>> The ASF is the natural choice to host the Data``Fu project as its goal of 
>> encouraging community-driven open-source projects fits with our vision for 
>> Data``Fu.  Additionally, other projects Data``Fu integrates with, such as 
>> Apache Pig and Apache Hadoop, and in the future Apache Hive and Apache 
>> Crunch, are hosted by the ASF and we will benefit and provide benefit by 
>> close proximity to them.
>>
>> == Known Risks ==
>>
>> === Orphaned Products ===
>>
>> The core developers have been contributing to Data``Fu for the past two 
>> years.  There is very little risk of Data``Fu being abandoned given its 
>> widespread use within Linked``In.
>>
>> === Inexperience with Open Source ===
>>
>> Data``Fu was started as an open source project in 2011 and has remained so 
>> for two years.  Matt initiated the project, and additionally is the creator 
>> of the open source White Elephant project.  He has also contributed patches 
>> to Apache Pig.  Most recently he has released Hourglass as a sub-project of 
>> Data``Fu.  Sam contributed much of the original code and continues to 
>> contribute to the project.  Will has been contributing to Data``Fu since it 
>> was first open sourced.  Evion has been contributing for the past year.  
>> Mathieu leads the open source Gephi project.  Jakob has been actively 
>> involved with the ASF as a full-time Hadoop committer and PMC member.
>>
>> === Homogeneous Developers ===
>>
>> The current core developers are all from Linked``In.  Data``Fu has also 
>> received contributions from other corporations such as Cloudera.  Two of 
>> these developers are among the Initial Committers listed below.  We hope to 
>> establish a developer community that includes contributors from several 
>> other corporations and we are actively encouraging new contributors via 
>> presentations and blog posts.
>>
>> === Reliance on Salaried Developers ===
>>
>> The current core developers are salaried employees of Linked``In, however 
>> they are not paid specifically to work on Data``Fu.  Contributions to 
>> Data``Fu arise from the developers solving problems they encounter in their 
>> various projects.  The purpose of Data``Fu is to share these solutions so 
>> that others may benefit and build a community of developers striving to 
>> solve common problems together.  Furthermore, once the project has a 
>> community built around it, we expect to get committers, developers and 
>> contributions from outside the current core developers.
>>
>> === Relationships with Other Apache Products ===
>>
>> Data``Fu is deeply integrated with Apache products.  It began as a library 
>> of user-defined functions for Apache Pig.  It has grown to also include 
>> Hadoop jobs for incremental data processing and in the future will include 
>> code for other higher level languages built on top of Apache Hadoop.
>>
>> === An Excessive Obsession with the Apache Brand ===
>>
>> While we respect the reputation of the Apache brand and have no doubts that 
>> it will attract contributors and users, our interest is primarily to give 
>> Data``Fu a solid home as an open source project following an established 
>> development model.
>>
>> == Documentation ==
>>
>> Information on Data``Fu can be found at:
>>
>> https://github.com/LinkedIn/DataFu/blob/master/README.md
>>
>> == Initial Source ==
>>
>> The initial source is available at:
>>
>> https://github.com/LinkedIn/DataFu
>>
>> == Source and Intellectual Property Submission Plan ==
>>
>>  * The Data``Fu library source code, available on Git``Hub.
>>
>> == External Dependencies ==
>>
>> The initial source has the following external dependencies that are either 
>> included in the final Data``Fu library or required in order to use it:
>>
>>  * fastutil (Apache 2.0)
>>  * joda-time (Apache 2.0)
>>  * commons-math (Apache 2.0)
>>  * guava (Apache 2.0)
>>  * stream (Apache 2.0)
>>  * jsr-305 (BSD)
>>  * log4j (Apache 2.0)
>>  * json (The JSON License)
>>  * avro (Apache 2.0)
>>
>> In addition, the following external libraries are used either in building, 
>> developing, or testing the project:
>>
>>  * pig (Apache 2.0)
>>  * hadoop (Apache 2.0)
>>  * jline (BSD)
>>  * antlr (BSD)
>>  * commons-io (Apache 2.0)
>>  * testng (Apache 2.0)
>>  * maven (Apache 2.0)
>>  * jsr-311 (CDDL-1.0)
>>  * slf4j (MIT)
>>  * eclipse (Eclipse Public License 1.0)
>>  * autojar (GPLv2)
>>  * jarjar (Apache 2.0)
>>
>> == Cryptography ==
>>
>> Data``Fu has user-defined functions that use MD5 and SHA provided by Java’s 
>> java.security.Message``Digest.
>>
>> == Required Resources ==
>>
>> === Mailing Lists ===
>>
>> Data``Fu-private for private PMC discussions (with moderated subscriptions) 
>> Data``Fu-dev Data``Fu-commits
>>
>> === Subversion Directory ===
>>
>> Git is the preferred source control system: git://git.apache.org/DataFu
>>
>> === Issue Tracking ===
>>
>> JIRA Data``Fu (Data``Fu)
>>
>> === Other Resources ===
>>
>> The existing code already has unit tests, so we would like a Hudson instance 
>> to run them whenever a new patch is submitted. This can be added after 
>> project creation.
>>
>> == Initial Committers ==
>>
>>  * Matthew Hayes
>>  * William Vaughan
>>  * Evion Kim
>>  * Sam Shah
>>  * Xiangrui Meng
>>  * Christopher Lloyd
>>  * Mathieu Bastian
>>  * Mitul Tiwari
>>  * Josh Wills
>>  * Jarek Jarcec Cecho
>>
>> == Affiliations ==
>>
>>  * Matthew Hayes (Linked``In)
>>  * William Vaughan (Linked``In)
>>  * Evion Kim (Linked``In)
>>  * Sam Shah (Linked``In)
>>  * Xiangrui Meng (Linked``In)
>>  * Christopher Lloyd (Linked``In)
>>  * Mathieu Bastian (Linked``In)
>>  * Mitul Tiwari (Linked``In)
>>  * Josh Wills (Cloudera)
>>  * Jarek Jarcec Cecho (Cloudera)
>>
>> == Sponsors ==
>>
>> === Champion ===
>>
>> Jakob Homan (Apache Member)
>>
>> === Nominated Mentors ===
>>
>>  * Ashutosh Chauhan <hashutosh at apache dot org>
>>  * Roman Shaposhnik <rvs at apache dot org>
>>  * Ted Dunning <tdunning at apache dot org>
>>
>> === Sponsoring Entity ===
>>
>> We are requesting the Incubator to sponsor this project.
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
> For additional commands, e-mail: general-h...@incubator.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
> For additional commands, e-mail: general-h...@incubator.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org

Re: [PROPOSAL] DataFu for Incubation

Reply via email to