Thanks for pointing that out Henry! Yes, looks like H20 is not an apache project, I should have verified that. I will edit that, and revisit that section along with the folks in Singa community.
On Tue, Jan 27, 2015 at 6:55 PM, Henry Saputra <henry.sapu...@gmail.com> wrote: > Quick immediate comment that "Apache H2O" is not really Apache project. > > I assume you are referring to https://github.com/h2oai/h2o (or > https://github.com/h2oai/h2o-dev) ? > > - Henry > > On Tue, Jan 27, 2015 at 5:29 PM, Thejas Nair <thejas.n...@gmail.com> wrote: >> Hello everyone, >> >> I would like to propose the inclusion of Singa as an Apache Incubator >> project. >> >> Here is the proposal - https://wiki.apache.org/incubator/SingaProposal >> >> Please review the proposal and give feedback. I am planning to start a >> vote after 7 days if the proposal looks good. >> We are also seeking additional Apache mentors for the project. >> >> Thanks, >> Thejas >> ========================================================== >> Singa Incubator Proposal >> >> Abstract >> >> SINGA is a distributed deep learning platform. >> >> Proposal >> >> SINGA is an efficient, scalable and easy-to-use distributed platform >> for training deep learning models, e.g., Deep Convolutional Neural >> Network and Deep Belief Network. It parallelizes the computation >> (i.e., training) onto a cluster of nodes by distributing the training >> data and model automatically to speed up the training. Built-in >> training algorithms like Back-Propagation and Contrastive Divergence >> are implemented based on common abstractions of deep learning models. >> Users can train their own deep learning models by simply customizing >> these abstractions like implementing the Mapper and Reducer in Hadoop. >> >> Background >> >> Deep learning refers to a set of feature (or representation) learning >> models that consist of multiple (non-linear) layers, where different >> layers learn different levels of abstractions (representations) of the >> raw input data. Larger (in terms of model parameters) and deeper (in >> terms of number of layers) models have shown better performance, e.g., >> lower image classification error in Large Scale Visual Recognition >> Challenge. However, a larger model requires more memory and larger >> training data to reduce over-fitting. Complex numeric operations make >> the training computation intensive. In practice, training large deep >> learning models takes weeks or months on a single node (even with >> GPU). >> >> Rational >> >> Deep learning has gained a lot of attraction in both academia and >> industry due to its success in a wide range of areas such as computer >> vision and speech recognition. However, training of such models is >> computationally expensive, especially for large and deep models (e.g., >> with billions of parameters and more than 10 layers). Both Google and >> Microsoft have developed distributed deep learning systems to make the >> training more efficient by distributing the computations within a >> cluster of nodes. However, these systems are closed source softwares. >> Our goal is to leverage the community of open source developers to >> make SINGA efficient, scalable and easy to use. SINGA is a full >> fledged distributed platform, that could benefit the community and >> also benefit from the community in their involvement in contributing >> to the further work in this area. We believe the nature of SINGA and >> our visions for the system fit naturally to Apache's philosophy and >> development framework. >> >> Initial Goals >> >> We have developed a system for SINGA running on a commodity computer >> cluster. The initial goals include, * improving the system in terms of >> scalability and efficiency, e.g., using Infiniband for network >> communication and multi-threading for one node computation. We would >> consider extending SINGA to GPU clusters later. * benchmarking with >> larger datasets (hundreds of millions of training instances) and >> models (billions of parameters). * adding more built-in deep learning >> models. Users can train the built-in models on their datasets >> directly. >> >> Current Status >> >> Meritocracy >> >> We would like to follow ASF meritocratic principles to encourage more >> developers to contribute in this project. We know that only active and >> excellent developers can make SINGA a successful project. The >> committer list and PMC will be updated based on developers' >> performance and commitment. We are also improving the documentation >> and code to help new developers get started quickly. >> >> Community >> >> SINGA is currently being developed in the Database System Research Lab >> at the National University of Singapore (NUS) in collaboration with >> Zhejiang University in China. Our lab has extensive experience in >> building database related systems, including distributed systems. Six >> PhD students and research assistants (Jinyang Gao, Kaiping Zheng, >> Sheng Wang, Wei Wang, Zhaojing Luo and Zhongle Xie) , a research >> fellow (Anh Dinh) and three professors (Beng Chin Ooi, Gang Chen, Kian >> Lee Tan) have been working for a year on this project. We are open to >> recruiting more developers from diverse backgrounds. >> >> Core Developers >> >> Beng Chin Ooi, Gang Chen and Kian Lee Tan are professors who have >> worked on distributed systems for more than 20 years. They have >> collaborated with the industry and have built various large scale >> systems. Anh Dinh's research is also on distributed systems, albeit >> with more focus on security aspects. Wei Wang's research is on deep >> learning problems including deep learning applications and large scale >> training. Sheng Wang and Jinyang are working on efficient indexing, >> querying of large scale data and machine learning. Kaiping, Zhaojing >> and Zhongle are new PhD students who jointed SINGA recently. They will >> work on this project for a longer time (next 4-5 years). While we >> share common research interests, each member also brings diverse >> expertise to the team. >> >> Alignment >> >> ASF is already the home of many distributed platforms, e.g., Hadoop, >> Spark and Mahout, each of which targets a different application >> domain. SINGA, being a distributed platform for large-scale deep >> learning, focuses on another important domain for which there still >> lacks a robust and scalable open-source platform. The recent success >> of deep learning models especially for vision and speech recognition >> tasks has generated interests in both applying existing deep learning >> models and in developing new ones. Thus, an open-source platform for >> deep learning will be able to attract a large community of users and >> developers. SINGA is a complex system needing many iterations of >> design, implementation and testing. Apache's collaboration framework >> which encourages active contribution from developers will inevitably >> help improve the quality of the system, as shown in the success of >> Hadoop, Spark, etc.. Equally important is the community of users which >> helps identify real-life applications of deep learning, and helps to >> evaluate the system's performance and ease-of-use. We hope to leverage >> ASF for coordinating and promoting both communities, and in return >> benefit the communities with another useful tool. >> >> Known Risks >> >> Orphaned products >> >> Four core developers (Anh, Wei Wang, Jinyang and Sheng Wang) may leave >> the lab in two to four years time. It is possible that some of them >> may not have enough time to focus on this project after that. But, >> SINGA is part of our other bigger research projects on building an >> infrastructure for data intensive applications, which include >> health-care analytics and brain-inspired computing. Beng Chin and Kian >> Lee would continue working on it and getting more people involved. For >> example, three new developers (Kaiping, Zhaojing and Zhongle) joined >> us recently. Individual developers are welcome to make SINGA a diverse >> community that is robust and independent from any single developer. >> >> Inexperience with Open Source >> >> All the developers are active users and followers of open source >> projects. Our research lab has a strong commitment to open source, and >> has released the source code of several systems under open source >> license as a way of contributing back to the open source community. >> But we do not have much real experience in open source projects with >> large and well organized communities like those in Apache. This is one >> reason we choose Apache which is experienced in open source project >> incubation. We hope to get the help from Apache (e.g., champion and >> mentors) to establish a healthy path for SINGA. >> >> Homogenous Developers >> >> Although the current developers are researchers in the universities, >> they have different research interests and project experiences, as >> mentioned in the section that introduces the core developers. We know >> that a diverse community is helpful. Hence we are open to the idea of >> recruiting developers from other regions and organizations. >> >> Reliance on Salaried Developers >> >> As a research project in the university, SINGA's current developing >> community consists of professors, PhD students, research assistants >> and postdoctoral fellows. They are driven by their interests to work >> on this project and have contributed actively since the start of the >> project. The research assistants and fellows are expected to leave >> when their contracts expire. However, they are keen to continue to >> work on the project voluntarily. Moreover, as a long term research >> project, new research assistants and fellows are likely to join the >> project. >> >> A Excessive Fascination with the Apache Brand >> >> We choose Apache not for publicity. We have two purposes. First, we >> want to leverage Apache's reputation to recruit more developers to >> make a diverse community. Second, we hope that Apache can help us to >> establish a healthy path in developing SINGA. Beng Chin and Kian-Lee >> are established database and distributed system researchers, and >> together with the other contributors, they sincerely believe that >> there is a need for a widely accepted open source distributed deep >> learning platform. The field of deep learning is still at its infancy, >> and an open source platform will fuel the research in the area. >> Moreover, such a platform will enable researchers to develop new >> models and algorithms, rather than spending time implementing a deep >> learning system from scratch. Furthermore, the need for scalability >> for such a platform is obvious. >> >> Relationship with Other Apache Products >> >> Apache H2O implemented two simple deep learning models, namely the >> Multi-Layer Perceptron and Deep Auto-encoders. There are two >> significant differences between H2O and SINGA. First, H2O adopts the >> Map-Reduce framework which runs a set of computing nodes in parallel >> againsts of the training set. Model parameters trained by all >> computing nodes are averaged as the final model parameters. This >> training algorithm is different from the distributed training >> algorithm used by DistBelief, Adam and SINGA, which frequently >> synchronizes the parameters trained from different nodes. SINGA adopts >> the parameter server framework to support a wide range of distributed >> training algorithms and parallelization methods (e.g., data >> parallelism, model parallelism and hybrid parallelism. H2O only >> support data parallelism) . Second, in H2O, users are restricted to >> use the two built-in models. In SINGA, we provide simple programming >> model to let users implement their own deep learning models. A new >> deep learning model can be implemented by customizing the base Layer >> class for each layer involved in the model. It is similar to writing >> Hadoop programs where users only need to override the base Mapper and >> Reducer. We also provide built-in models for users to use directly. >> >> Documentation >> >> The project is hosted at >> http://www.comp.nus.edu.sg/~dbsystem/project/singa.html. >> Documentations can be found at the Github Wiki Page: >> https://github.com/nusinga/singa/wiki. We continue to refine and >> improve the documentation. >> >> Initial Source >> >> We use Github to maintain our source code, https://github.com/nusinga/singa >> >> Source and Intellectual Property Submission Plan >> >> We plan to make our code base be under Apache License, Version 2.0. >> >> External Dependencies >> >> required by the core code base: glog, gflags, google protobuf, >> open-blas, mpich, armci-mpi. >> required by data preparation and preprocessing: opencv, hdfs, python. >> >> Cryptography >> >> Not Applicable >> >> Required Resources >> >> Mailing Lists >> >> Currently, we use google group for internal discussion. The mailing >> address is nusi...@googlegroup.com. We will migrate the content to the >> apache mailing lists in the future. >> >> singa-dev >> singa-user >> singa-commits >> singa-private (for private discussion within PCM) >> >> Git Repository >> >> We want to continue using git for version control. Hence, a git repo >> is required. >> >> Issue Tracking >> >> JIRA Singa (SINGA) >> >> Initial Committers >> >> Beng Chin Ooi (ooibc @comp.nus.edu.sg) >> Kian Lee Tan (tankl @comp.nus.edu.sg) >> Gang Chen (cg @zju.edu.cn) >> Wei Wang (wangwei @comp.nus.edu.sg) >> Dinh Tien Tuan Anh (dinhtta @comp.nus.edu.sg) >> Jinyang Gao (jinyang.gao @comp.nus.edu.sg) >> Sheng Wang (wangsh @comp.nus.edu.sg) >> Kaiping Zheng (kaiping @comp.nus.edu.sg) >> Zhaojing Luo (zhaojing @comp.nus.edu.sg) >> Zhongle Xie (zhongle @comp.nus.edu.sg) >> >> Affiliations >> >> Beng Chin Ooi, National University of Singapore >> Kian Lee Tan, National University of Singapore >> Gang Chen, Zhejiang University >> Wei Wang, National University of Singapore >> Dinh Tien Tuan Anh, National University of Singapore >> Jinyang Gao, National University of Singapore >> Sheng Wang, National University of Singapore >> Kaiping Zheng, National University of Singapore >> Zhaojing Luo, National University of Singapore >> Zhongle Xie, National University of Singapore >> >> Sponsors >> >> Champion >> >> Thejas Nair (thejas at apache.org) - Hortonworks >> >> Nominated Mentors >> >> Thejas Nair (thejas at apache.org) - Hortonworks >> Alan Gates (gates at apache dot org) - Hortonworks >> (Seeking more volunteers!) >> >> Sponsoring Entity >> >> We are requesting the Incubator to sponsor this project. >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org >> For additional commands, e-mail: general-h...@incubator.apache.org >> > > --------------------------------------------------------------------- > To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org > For additional commands, e-mail: general-h...@incubator.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For additional commands, e-mail: general-h...@incubator.apache.org