JB, Sure. Looking forward to see potential mentors.
We have currently 3 mentors but additional mentors are welcome since some of them are busy taking vacation :-) Thanks, Makoto 2016-08-31 17:37 GMT+09:00 Jean-Baptiste Onofré <j...@nanthrax.net>: > I'm very busy mentoring my current podling bucket. I'm sure other potential > mentors will contact you ! > > Regards > JB > > > On 08/31/2016 09:18 AM, Makoto Yui wrote: >> >> Jean-Baptistle, >> >> Your experience as Podling mentor is very welcome. >> >> Regards, >> Makoto >> >> 2016-08-31 15:24 GMT+09:00 Jean-Baptiste Onofré <j...@nanthrax.net>: >>> >>> Hi Makoto, >>> >>> it would have been with lot of pleasure, but I'm already mentor in >>> several >>> podlings. >>> >>> Regards >>> JB >>> >>> >>> On 08/31/2016 06:30 AM, Makoto Yui wrote: >>>> >>>> >>>> As Roman mentioned, we welcome volunteering mentors. >>>> >>>> Please find our proposal in >>>> https://wiki.apache.org/incubator/HivemallProposal >>>> >>>> Thanks, >>>> Makoto >>>> >>>> 2016-08-31 11:28 GMT+09:00 Roman Shaposhnik <r...@apache.org>: >>>>> >>>>> >>>>> Hi! >>>>> >>>>> It seems that the discussion has converged and I'd like to >>>>> make one extra call for volunteering mentors. Please let >>>>> me know ASAP since I'd like to get the VOTE going tomorrow. >>>>> >>>>> Thanks, >>>>> Roman. >>>>> >>>>> On Mon, Aug 22, 2016 at 10:20 AM, Roman Shaposhnik <r...@apache.org> >>>>> wrote: >>>>>> >>>>>> >>>>>> Hi! >>>>>> >>>>>> on behalf of the Hivemall team, I'd like to kick off >>>>>> a discussion thread around accepting Hivemall >>>>>> into and ASF Incubator. >>>>>> >>>>>> Hivemall is a library for machine learning implemented >>>>>> as Hive UDFs/UDAFs/UDTFs that runs on Hadoop-based d >>>>>> ata processing frameworks. More specifically it runs currently >>>>>> runs on Apache Hive, Apache Spark, and Apache Pig, that >>>>>> support Hive UDFs as an extension mechanism. >>>>>> >>>>>> Here's the link to the proposal: >>>>>> https://wiki.apache.org/incubator/HivemallProposal >>>>>> and the full text is also attached to this email. >>>>>> >>>>>> Two of the areas that I'd like to explicitly solicit IPMC's opinion >>>>>> on are: >>>>>> 1. whether the process of re-licensing from LGPL to ALv2 >>>>>> was enough given the ASF's strict IP policies >>>>>> >>>>>> 2. whether the 5 initial committers make sense given that >>>>>> there's a total of 15 contributors as per GitHub stats. >>>>>> >>>>>> With that, thanks, in advance, for your time and let the discussion >>>>>> begin! >>>>>> >>>>>> Thanks, >>>>>> Roman. >>>>>> >>>>>> == Abstract == >>>>>> >>>>>> Hivemall is a library for machine learning implemented as Hive >>>>>> UDFs/UDAFs/UDTFs. >>>>>> >>>>>> Hivemall runs on Hadoop-based data processing frameworks, specifically >>>>>> on Apache Hive, Apache Spark, and Apache Pig, that support Hive UDFs >>>>>> as an extension mechanism. >>>>>> >>>>>> == Proposal == >>>>>> >>>>>> Hivemall is a collection of machine learning algorithms and versatile >>>>>> data analytics functions. It provides a number of ease of use machine >>>>>> learning functionalities through user-defined function (UDF), >>>>>> user-defined aggregate function (UDAFs), and/or user-defined table >>>>>> generating functions (UDTFs) of Apache Hive. It offers a variety of >>>>>> functionalities: regression, classification, recommendation, anomaly >>>>>> detection, k-nearest neighbor, and feature engineering. Hivemall >>>>>> supports state-of-the-art machine learning algorithms such as Soft >>>>>> Confidence Weighted, Adaptive Regularization of Weight Vectors, >>>>>> Factorization Machines, and AdaDelta. Hivemall is mainly designed to >>>>>> run on Apache Hive but it also supports Apache Pig and Apache Spark >>>>>> for the runtime. >>>>>> >>>>>> == Background == >>>>>> >>>>>> Hivemall started as a research project of the main developer at >>>>>> National Institute of Advanced Industrial Science and Technology >>>>>> (AIST) in 2013 and the initial version was released on 2 Oct, 2013 on >>>>>> Github: https://github.com/myui/hivemall. >>>>>> >>>>>> After the main developer moving to Treasure Data in 2015, the project >>>>>> has been actively developed as an open source product and changed the >>>>>> license from GNU LGPL v2.1 to Apache License v2 on Mar 16, 2015. The >>>>>> project copyright holders agreed to change the license then. >>>>>> >>>>>> The community is growing incrementally and the project has 15 >>>>>> contributors, 431 stars, and 131 forks on Github as of Aug 15, 2016. >>>>>> The project was awarded for the InfoWorld Bossie Awards (the best open >>>>>> source big data tools) in 2014. >>>>>> >>>>>> Past main contributions by external contributors includes Apache Pig >>>>>> supports from Daniel Dai (Hortonworks), Apache Spark porting and an >>>>>> integration to Apache YARN from Takeshi Yamamuro (NTT). Hivemall was >>>>>> originally designed for Apache Hive but it now supports Apache Spark >>>>>> and Apache Pig. >>>>>> >>>>>> == Rationale == >>>>>> >>>>>> User-defined function is a powerful mechanism to enrich the expressive >>>>>> power of declarative query languages like SQL, HiveQL, PigLatin, Spark >>>>>> SQL. Hive UDF interface is now becoming the de-facto standard for >>>>>> SQL-on-Hadoop platforms; Apache Spark and Apache Pig have full >>>>>> supports for Hive UDFs/UDAFs/UDTFs, and Apache Impala, Apache Drill, >>>>>> and Apache Tajo also have limited supports for Hive UDFs/UDAFs. >>>>>> >>>>>> Hivemall can be considered as a cross platform library for machine >>>>>> learning as Hivemall is implemented as cross platform Hive >>>>>> UDFs/UDAFs/UDTFs; prediction models built by a batch query of Apache >>>>>> Hive can be used on Apache Spark/Pig, and conversely, prediction >>>>>> models build by Apache Spark can be used from Apache Hive/Pig. >>>>>> >>>>>> Several database vendors are trying to offer machine learning >>>>>> functionality in relational databases, so that the costs of moving >>>>>> data can be eliminated. Apache MADlib, a machine learning library for >>>>>> HAWQ and PostgreSQL, is accepted as an Apache Incubator project. >>>>>> MADlib is implemented using PostgreSQL UDF interface. >>>>>> >>>>>> Apache Hive has a JIRA ticket in HIVE-7940 to support machine learning >>>>>> functionalities. So, we consider this proposal is useful for the >>>>>> community. We consider that Hivemall is better to be a separated >>>>>> project to the Apache Hive because 1) we target other data processing >>>>>> frameworks such as Apache Spark as well for the runtime of Hivemall, >>>>>> and 2) the current codebase is large enough to be separated. >>>>>> Separation of concerns is good for project governance (e.g., release >>>>>> management). For example, Apache Datafu is data mining and statistics >>>>>> library for Apache Pig and a separated project to Apache Pig. >>>>>> >>>>>> We consider that Hivemall would be a similar position to Apache Datafu >>>>>> but there are large differences in features and target runtimes. >>>>>> The target runtime of Apache Datafu is Apache Pig but Hivemall targets >>>>>> Apache Hive, Apache Spark, and Apache Pig for the target runtime. >>>>>> Apache Datafu is more likely to be statistics library and does not >>>>>> support machine learning features such as classification and >>>>>> regression but Hivemall is a machine learning library supporting them. >>>>>> >>>>>> == Initial Goals == >>>>>> >>>>>> The initial goals are as follows: >>>>>> * Establish the project governance in the Apache way and broaden the >>>>>> community >>>>>> * Improve documentations. >>>>>> * Adding more unit/scenario tests. >>>>>> * Handover of code and copyrights >>>>>> >>>>>> == Current Status == >>>>>> >>>>>> Hivemall has several on-going WIP features. >>>>>> >>>>>> Making a parameter server (a kind of distributed key-value store) as >>>>>> Apache YARN application is a major issue. Hivemall’s parameter server >>>>>> is currently a standalone application. Parameter servers on Apache >>>>>> YARN enables to use Hadoop cluster resource efficiently and makes >>>>>> management of parameter servers easier. >>>>>> >>>>>> Another major WIP issue is integrating XGBoost into Hivemall. We need >>>>>> more works and tests, e.g., supporting cross compilation of native JNI >>>>>> objects of XGBoost. >>>>>> >>>>>> === Meritocracy === >>>>>> >>>>>> The project members understand the importance of letting motivated >>>>>> individuals contribute to the project. Since Hivemall was initially >>>>>> released in 2014, it has received contributions from 14 contributors. >>>>>> >>>>>> Our intent of this incubator proposal is building a diverse developer >>>>>> community following the Apache meritocracy model. We welcome external >>>>>> contributions and plan to elect committers from those who contribute >>>>>> significantly to the project. >>>>>> >>>>>> === Community === >>>>>> >>>>>> While there are 15 contributors in total, there are 3-4 active >>>>>> developers continuously involved for the major feature development at >>>>>> the moment. We hope to extend our contributor base and encourages >>>>>> suggestions and contributions from any potential user. >>>>>> >>>>>> === Core Developers === >>>>>> >>>>>> The current main developers are from employees of Treasure Data, NTT >>>>>> and Hortonworks. Some of them are Hadoop/Pig PMCs and/or Hive >>>>>> committers. >>>>>> >>>>>> === Alignment === >>>>>> >>>>>> Incubating at ASF is the natural choice for the Hivemall project >>>>>> because the Hivemall is targeting to run on Apache Hive, Apache Spark, >>>>>> and Apache Pig. We encourage integrations with other ASF data >>>>>> processing frameworks like Apache Impala and Apache Drill. >>>>>> >>>>>> == Known Risks == >>>>>> >>>>>> The contributions of the main developer is significant at the moment >>>>>> but the dependencies would decrease as the community grows. >>>>>> >>>>>> === Orphaned products === >>>>>> >>>>>> While the main developer is developing Hivemall as a full-time job at >>>>>> TreasureData, the company is well being aware of the open source >>>>>> philosophy and the importance of open governance of open source >>>>>> products. Orphanining ASF product can be considered itself as a risk. >>>>>> Hence, we think the the risks of it being orphaned are minimal. >>>>>> >>>>>> === Inexperience with Open Source === >>>>>> >>>>>> Hivemall also has been developed as an open source project since 2013. >>>>>> The majority of the project member have jobs developing open source >>>>>> products and some of them are working on other ASF projects like >>>>>> Apache Hadoop and Apache Pig. We thus considered that the project >>>>>> members have enough experiences for open source development. >>>>>> >>>>>> === Homogenous Developers === >>>>>> >>>>>> The current list of committers consists of developers from three >>>>>> different companies. The committers are geographically distributed >>>>>> across the U.S. and Asia. They are experienced with working in a >>>>>> distributed environment. >>>>>> >>>>>> While not included in the initial committer, there are other external >>>>>> contributors to the project. So, we hope to establish a developer >>>>>> community that includes those contributors from several other >>>>>> corporations during the incubation process. >>>>>> >>>>>> === Reliance on Salaried Developers === >>>>>> >>>>>> The major developer is paid by his employer to contribute to this >>>>>> project and the other developers are payed by their employers for >>>>>> Hadoop-related open source development. While they might change their >>>>>> affiliations over time, they are willing to have their expertise for >>>>>> the open source development. So, the project would continue regardless >>>>>> their affiliations. >>>>>> >>>>>> === Relationships with Other Apache Products === >>>>>> >>>>>> Hivemall is a collection for machine learning functions on Apache >>>>>> Hive, Apache Spark, and Apache Pig. Apache MADlib is a collection of >>>>>> machine learning functions for relational databases, i.e., Apache HAWQ >>>>>> and PostgreSQL. There is no conflict in their target runtimes. >>>>>> >>>>>> === A Excessive Fascination with the Apache Brand === >>>>>> >>>>>> Our interest for this incubation is attracting more contributors, >>>>>> building a strong community with open governance, and increasing the >>>>>> visibility of Hivemall in the market/community. We will be sensitive >>>>>> to inadvertent abuse of the Apache brand for any commercial use and >>>>>> will work with the Incubator PMC and project mentors to ensure the >>>>>> brand policies are respected. >>>>>> >>>>>> == Documentation == >>>>>> >>>>>> Information on Hivemall can be found at: >>>>>> https://github.com/myui/hivemall/wiki >>>>>> >>>>>> == Initial Source == >>>>>> >>>>>> We released the initial version of Hivemall in 2013 at >>>>>> https://github.com/myui/hivemall and introduced Hivemall at the Hadoop >>>>>> Summit 2014. >>>>>> >>>>>> == Source and Intellectual Property Submission Plan == >>>>>> >>>>>> We know no legal encumberment to transfer of the source to Apache. We >>>>>> are going to get Contributor License Agreement (CLA) for all property >>>>>> of Hivemall. >>>>>> >>>>>> Also, we plan to get a sign from AIST for Software Grant Agreement >>>>>> (SGA). >>>>>> >>>>>> == External Dependencies == >>>>>> >>>>>> Hivemall depends on the following third party libraries: >>>>>> >>>>>> Core module: >>>>>> * netty (The MIT License) >>>>>> * smile (Apache License v2.0) >>>>>> * org.takuaani.xz (Public Domain) >>>>>> * xgboost (Apache License v2.0) >>>>>> * hadoop (Apache License v2.0) >>>>>> * hive (Apache License v2.0) >>>>>> * log4j (Apache License v2.0) >>>>>> * guava (Apache License v2.0) >>>>>> * lucene-analyzers-kuromoji (Apache License v2.0) >>>>>> * junit (Eclipse Public License v1.0) >>>>>> * mockito (The MIT License) >>>>>> * powermock (Apache License v2.0) >>>>>> * kryo (BSD License) >>>>>> >>>>>> Hivemall on Spark: >>>>>> * spark (Apache License v2.0) >>>>>> * commons-cli (Apache License v2.0) >>>>>> * commons-logging (Apache License v2.0) >>>>>> * commons-compress (Apache License v2.0) >>>>>> * scala-library (BSD License) >>>>>> * scalatest (Apache License v2.0) >>>>>> * xerial-core (Apache License v2.0) >>>>>> >>>>>> The dependencies all have Apache compatible licenses. >>>>>> >>>>>> == Cryptography == >>>>>> >>>>>> N/A >>>>>> >>>>>> == Required resources == >>>>>> >>>>>> === Mailing lists === >>>>>> >>>>>> * priv...@hivemall.incubator.apache.org (with moderated >>>>>> subscriptions) >>>>>> * comm...@hivemall.incubator.apache.org >>>>>> * d...@hivemall.incubator.apache.org >>>>>> * u...@hivemall.incubator.apache.org >>>>>> >>>>>> === Git Repository === >>>>>> >>>>>> https://git-wip-us.apache.org/repos/asf/incubator-hivemall.git >>>>>> >>>>>> === JIRA assistance === >>>>>> >>>>>> JIRA project Hivemall (HIVEMALL) >>>>>> >>>>>> == Initial Committers == >>>>>> >>>>>> * Makoto Yui (m...@treasure-data.com) >>>>>> * Takeshi Yamamuro (yamamuro.tak...@lab.ntt.co.jp) >>>>>> * Daniel Dai (da...@hortonworks.com) >>>>>> * Tsuyoshi Ozawa (ozawa.tsuyo...@lab.ntt.co.jp) >>>>>> * Kai Sasaki (sas...@treasure-data.com) >>>>>> >>>>>> == Affiliations == >>>>>> >>>>>> === Treasure Data === >>>>>> * Makoto Yui >>>>>> * Kai Sasaki >>>>>> >>>>>> === NTT === >>>>>> * Takeshi Yamamuro >>>>>> * Tsuyoshi Ozawa Apache Hadoop PMC member >>>>>> >>>>>> === Hortonworks === >>>>>> * Daniel Dai (ASF member) Apache Pig PMC member >>>>>> >>>>>> == Sponsors == >>>>>> >>>>>> === Champion === >>>>>> * Roman Shaposhnik (Pivotal, ASF member, IPMC member) Apache >>>>>> Bigtop/Incubator PMC member >>>>>> >>>>>> === Nominated Mentors === >>>>>> >>>>>> * Reynold Xin (Dataricks, ASF member) Apache Spark PMC member >>>>>> * Markus Weimer (Microsoft, ASF member) Apache REEF PMC member >>>>>> * Xiangrui Meng (Databricks, ASF member) Apache Spark PMC member >>>>>> >>>>>> === Sponsoring Entity === >>>>>> >>>>>> We are requesting the Incubator to sponsor this project. >>>>> >>>>> >>>>> >>>>> --------------------------------------------------------------------- >>>>> To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org >>>>> For additional commands, e-mail: general-h...@incubator.apache.org >>>>> >>>> >>>> >>>> >>> >>> -- >>> Jean-Baptiste Onofré >>> jbono...@apache.org >>> http://blog.nanthrax.net >>> Talend - http://www.talend.com >>> >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org >>> For additional commands, e-mail: general-h...@incubator.apache.org >>> >> >> >> > > -- > Jean-Baptiste Onofré > jbono...@apache.org > http://blog.nanthrax.net > Talend - http://www.talend.com > > --------------------------------------------------------------------- > To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org > For additional commands, e-mail: general-h...@incubator.apache.org > -- Makoto YUI <myui AT treasure-data.com> Research Engineer, Treasure Data, Inc. http://myui.github.io/ --------------------------------------------------------------------- To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For additional commands, e-mail: general-h...@incubator.apache.org