As Roman mentioned, we welcome volunteering mentors. Please find our proposal in https://wiki.apache.org/incubator/HivemallProposal
Thanks, Makoto 2016-08-31 11:28 GMT+09:00 Roman Shaposhnik <r...@apache.org>: > Hi! > > It seems that the discussion has converged and I'd like to > make one extra call for volunteering mentors. Please let > me know ASAP since I'd like to get the VOTE going tomorrow. > > Thanks, > Roman. > > On Mon, Aug 22, 2016 at 10:20 AM, Roman Shaposhnik <r...@apache.org> wrote: >> Hi! >> >> on behalf of the Hivemall team, I'd like to kick off >> a discussion thread around accepting Hivemall >> into and ASF Incubator. >> >> Hivemall is a library for machine learning implemented >> as Hive UDFs/UDAFs/UDTFs that runs on Hadoop-based d >> ata processing frameworks. More specifically it runs currently >> runs on Apache Hive, Apache Spark, and Apache Pig, that >> support Hive UDFs as an extension mechanism. >> >> Here's the link to the proposal: >> https://wiki.apache.org/incubator/HivemallProposal >> and the full text is also attached to this email. >> >> Two of the areas that I'd like to explicitly solicit IPMC's opinion >> on are: >> 1. whether the process of re-licensing from LGPL to ALv2 >> was enough given the ASF's strict IP policies >> >> 2. whether the 5 initial committers make sense given that >> there's a total of 15 contributors as per GitHub stats. >> >> With that, thanks, in advance, for your time and let the discussion begin! >> >> Thanks, >> Roman. >> >> == Abstract == >> >> Hivemall is a library for machine learning implemented as Hive >> UDFs/UDAFs/UDTFs. >> >> Hivemall runs on Hadoop-based data processing frameworks, specifically >> on Apache Hive, Apache Spark, and Apache Pig, that support Hive UDFs >> as an extension mechanism. >> >> == Proposal == >> >> Hivemall is a collection of machine learning algorithms and versatile >> data analytics functions. It provides a number of ease of use machine >> learning functionalities through user-defined function (UDF), >> user-defined aggregate function (UDAFs), and/or user-defined table >> generating functions (UDTFs) of Apache Hive. It offers a variety of >> functionalities: regression, classification, recommendation, anomaly >> detection, k-nearest neighbor, and feature engineering. Hivemall >> supports state-of-the-art machine learning algorithms such as Soft >> Confidence Weighted, Adaptive Regularization of Weight Vectors, >> Factorization Machines, and AdaDelta. Hivemall is mainly designed to >> run on Apache Hive but it also supports Apache Pig and Apache Spark >> for the runtime. >> >> == Background == >> >> Hivemall started as a research project of the main developer at >> National Institute of Advanced Industrial Science and Technology >> (AIST) in 2013 and the initial version was released on 2 Oct, 2013 on >> Github: https://github.com/myui/hivemall. >> >> After the main developer moving to Treasure Data in 2015, the project >> has been actively developed as an open source product and changed the >> license from GNU LGPL v2.1 to Apache License v2 on Mar 16, 2015. The >> project copyright holders agreed to change the license then. >> >> The community is growing incrementally and the project has 15 >> contributors, 431 stars, and 131 forks on Github as of Aug 15, 2016. >> The project was awarded for the InfoWorld Bossie Awards (the best open >> source big data tools) in 2014. >> >> Past main contributions by external contributors includes Apache Pig >> supports from Daniel Dai (Hortonworks), Apache Spark porting and an >> integration to Apache YARN from Takeshi Yamamuro (NTT). Hivemall was >> originally designed for Apache Hive but it now supports Apache Spark >> and Apache Pig. >> >> == Rationale == >> >> User-defined function is a powerful mechanism to enrich the expressive >> power of declarative query languages like SQL, HiveQL, PigLatin, Spark >> SQL. Hive UDF interface is now becoming the de-facto standard for >> SQL-on-Hadoop platforms; Apache Spark and Apache Pig have full >> supports for Hive UDFs/UDAFs/UDTFs, and Apache Impala, Apache Drill, >> and Apache Tajo also have limited supports for Hive UDFs/UDAFs. >> >> Hivemall can be considered as a cross platform library for machine >> learning as Hivemall is implemented as cross platform Hive >> UDFs/UDAFs/UDTFs; prediction models built by a batch query of Apache >> Hive can be used on Apache Spark/Pig, and conversely, prediction >> models build by Apache Spark can be used from Apache Hive/Pig. >> >> Several database vendors are trying to offer machine learning >> functionality in relational databases, so that the costs of moving >> data can be eliminated. Apache MADlib, a machine learning library for >> HAWQ and PostgreSQL, is accepted as an Apache Incubator project. >> MADlib is implemented using PostgreSQL UDF interface. >> >> Apache Hive has a JIRA ticket in HIVE-7940 to support machine learning >> functionalities. So, we consider this proposal is useful for the >> community. We consider that Hivemall is better to be a separated >> project to the Apache Hive because 1) we target other data processing >> frameworks such as Apache Spark as well for the runtime of Hivemall, >> and 2) the current codebase is large enough to be separated. >> Separation of concerns is good for project governance (e.g., release >> management). For example, Apache Datafu is data mining and statistics >> library for Apache Pig and a separated project to Apache Pig. >> >> We consider that Hivemall would be a similar position to Apache Datafu >> but there are large differences in features and target runtimes. >> The target runtime of Apache Datafu is Apache Pig but Hivemall targets >> Apache Hive, Apache Spark, and Apache Pig for the target runtime. >> Apache Datafu is more likely to be statistics library and does not >> support machine learning features such as classification and >> regression but Hivemall is a machine learning library supporting them. >> >> == Initial Goals == >> >> The initial goals are as follows: >> * Establish the project governance in the Apache way and broaden the >> community >> * Improve documentations. >> * Adding more unit/scenario tests. >> * Handover of code and copyrights >> >> == Current Status == >> >> Hivemall has several on-going WIP features. >> >> Making a parameter server (a kind of distributed key-value store) as >> Apache YARN application is a major issue. Hivemall’s parameter server >> is currently a standalone application. Parameter servers on Apache >> YARN enables to use Hadoop cluster resource efficiently and makes >> management of parameter servers easier. >> >> Another major WIP issue is integrating XGBoost into Hivemall. We need >> more works and tests, e.g., supporting cross compilation of native JNI >> objects of XGBoost. >> >> === Meritocracy === >> >> The project members understand the importance of letting motivated >> individuals contribute to the project. Since Hivemall was initially >> released in 2014, it has received contributions from 14 contributors. >> >> Our intent of this incubator proposal is building a diverse developer >> community following the Apache meritocracy model. We welcome external >> contributions and plan to elect committers from those who contribute >> significantly to the project. >> >> === Community === >> >> While there are 15 contributors in total, there are 3-4 active >> developers continuously involved for the major feature development at >> the moment. We hope to extend our contributor base and encourages >> suggestions and contributions from any potential user. >> >> === Core Developers === >> >> The current main developers are from employees of Treasure Data, NTT >> and Hortonworks. Some of them are Hadoop/Pig PMCs and/or Hive >> committers. >> >> === Alignment === >> >> Incubating at ASF is the natural choice for the Hivemall project >> because the Hivemall is targeting to run on Apache Hive, Apache Spark, >> and Apache Pig. We encourage integrations with other ASF data >> processing frameworks like Apache Impala and Apache Drill. >> >> == Known Risks == >> >> The contributions of the main developer is significant at the moment >> but the dependencies would decrease as the community grows. >> >> === Orphaned products === >> >> While the main developer is developing Hivemall as a full-time job at >> TreasureData, the company is well being aware of the open source >> philosophy and the importance of open governance of open source >> products. Orphanining ASF product can be considered itself as a risk. >> Hence, we think the the risks of it being orphaned are minimal. >> >> === Inexperience with Open Source === >> >> Hivemall also has been developed as an open source project since 2013. >> The majority of the project member have jobs developing open source >> products and some of them are working on other ASF projects like >> Apache Hadoop and Apache Pig. We thus considered that the project >> members have enough experiences for open source development. >> >> === Homogenous Developers === >> >> The current list of committers consists of developers from three >> different companies. The committers are geographically distributed >> across the U.S. and Asia. They are experienced with working in a >> distributed environment. >> >> While not included in the initial committer, there are other external >> contributors to the project. So, we hope to establish a developer >> community that includes those contributors from several other >> corporations during the incubation process. >> >> === Reliance on Salaried Developers === >> >> The major developer is paid by his employer to contribute to this >> project and the other developers are payed by their employers for >> Hadoop-related open source development. While they might change their >> affiliations over time, they are willing to have their expertise for >> the open source development. So, the project would continue regardless >> their affiliations. >> >> === Relationships with Other Apache Products === >> >> Hivemall is a collection for machine learning functions on Apache >> Hive, Apache Spark, and Apache Pig. Apache MADlib is a collection of >> machine learning functions for relational databases, i.e., Apache HAWQ >> and PostgreSQL. There is no conflict in their target runtimes. >> >> === A Excessive Fascination with the Apache Brand === >> >> Our interest for this incubation is attracting more contributors, >> building a strong community with open governance, and increasing the >> visibility of Hivemall in the market/community. We will be sensitive >> to inadvertent abuse of the Apache brand for any commercial use and >> will work with the Incubator PMC and project mentors to ensure the >> brand policies are respected. >> >> == Documentation == >> >> Information on Hivemall can be found at: >> https://github.com/myui/hivemall/wiki >> >> == Initial Source == >> >> We released the initial version of Hivemall in 2013 at >> https://github.com/myui/hivemall and introduced Hivemall at the Hadoop >> Summit 2014. >> >> == Source and Intellectual Property Submission Plan == >> >> We know no legal encumberment to transfer of the source to Apache. We >> are going to get Contributor License Agreement (CLA) for all property >> of Hivemall. >> >> Also, we plan to get a sign from AIST for Software Grant Agreement (SGA). >> >> == External Dependencies == >> >> Hivemall depends on the following third party libraries: >> >> Core module: >> * netty (The MIT License) >> * smile (Apache License v2.0) >> * org.takuaani.xz (Public Domain) >> * xgboost (Apache License v2.0) >> * hadoop (Apache License v2.0) >> * hive (Apache License v2.0) >> * log4j (Apache License v2.0) >> * guava (Apache License v2.0) >> * lucene-analyzers-kuromoji (Apache License v2.0) >> * junit (Eclipse Public License v1.0) >> * mockito (The MIT License) >> * powermock (Apache License v2.0) >> * kryo (BSD License) >> >> Hivemall on Spark: >> * spark (Apache License v2.0) >> * commons-cli (Apache License v2.0) >> * commons-logging (Apache License v2.0) >> * commons-compress (Apache License v2.0) >> * scala-library (BSD License) >> * scalatest (Apache License v2.0) >> * xerial-core (Apache License v2.0) >> >> The dependencies all have Apache compatible licenses. >> >> == Cryptography == >> >> N/A >> >> == Required resources == >> >> === Mailing lists === >> >> * priv...@hivemall.incubator.apache.org (with moderated subscriptions) >> * comm...@hivemall.incubator.apache.org >> * d...@hivemall.incubator.apache.org >> * u...@hivemall.incubator.apache.org >> >> === Git Repository === >> >> https://git-wip-us.apache.org/repos/asf/incubator-hivemall.git >> >> === JIRA assistance === >> >> JIRA project Hivemall (HIVEMALL) >> >> == Initial Committers == >> >> * Makoto Yui (m...@treasure-data.com) >> * Takeshi Yamamuro (yamamuro.tak...@lab.ntt.co.jp) >> * Daniel Dai (da...@hortonworks.com) >> * Tsuyoshi Ozawa (ozawa.tsuyo...@lab.ntt.co.jp) >> * Kai Sasaki (sas...@treasure-data.com) >> >> == Affiliations == >> >> === Treasure Data === >> * Makoto Yui >> * Kai Sasaki >> >> === NTT === >> * Takeshi Yamamuro >> * Tsuyoshi Ozawa Apache Hadoop PMC member >> >> === Hortonworks === >> * Daniel Dai (ASF member) Apache Pig PMC member >> >> == Sponsors == >> >> === Champion === >> * Roman Shaposhnik (Pivotal, ASF member, IPMC member) Apache >> Bigtop/Incubator PMC member >> >> === Nominated Mentors === >> >> * Reynold Xin (Dataricks, ASF member) Apache Spark PMC member >> * Markus Weimer (Microsoft, ASF member) Apache REEF PMC member >> * Xiangrui Meng (Databricks, ASF member) Apache Spark PMC member >> >> === Sponsoring Entity === >> >> We are requesting the Incubator to sponsor this project. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org > For additional commands, e-mail: general-h...@incubator.apache.org > -- Makoto YUI <myui AT treasure-data.com> Research Engineer, Treasure Data, Inc. http://myui.github.io/ --------------------------------------------------------------------- To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For additional commands, e-mail: general-h...@incubator.apache.org