+1 (binding) On Thu, Feb 28, 2013 at 11:52 AM, Matthias Friedrich <m...@mafr.de> wrote:
> +1 (non-binding) > > Looks really interesting, good luck! > > Regards, > Matthias > > On Friday, 2013-03-01, Hyunsik Choi wrote: > > Hi Folks, > > > > I'd like to call a VOTE for acceptance of Tajo into the Apache incubator. > > The vote will close on Mar 7 at 6:00 PM (PST). > > > > [] +1 Accept Tajo into the Apache incubator > > [] +0 Don't care. > > [] -1 Don't accept Tajo into the incubator because... > > > > Full proposal is pasted at the bottom on this email, and the > corresponding > > wiki is http://wiki.apache.org/incubator/TajoProposal. > > > > Only VOTEs from Incubator PMC members are binding, but all are welcome to > > express their thoughts. > > > > Thanks, > > Hyunsik > > > > PS: From the initial discussion, the main changes are that I've added 4 > new > > committers. Also, I've revised some description of Known Risks because > the > > initial committers have been diverse. > > > > ---------------- > > Tajo Proposal > > > > = Abstract = > > > > Tajo is a distributed data warehouse system for Hadoop. > > > > > > = Proposal = > > > > Tajo is a relational and distributed data warehouse system for Hadoop. > Tajo > > is designed for low-latency and scalable ad-hoc queries, online > aggregation > > and ETL on large-data sets by leveraging advanced database techniques. It > > supports SQL standards. Tajo is inspired by Dryad, MapReduce, Dremel, > > Scope, and parallel databases. Tajo uses HDFS as a primary storage layer, > > and it has its own query engine which allows direct control of > distributed > > execution and data flow. As a result, Tajo has a variety of query > > evaluation strategies and more optimization opportunities. In addition, > > Tajo will have a native columnar execution and and its optimizer. Tajo > will > > be an alternative choice to Hive/Pig on the top of MapReduce. > > > > > > = Background = > > > > Big data analysis has gained much attention in the industrial. Open > source > > communities have proposed scalable and distributed solutions for ad-hoc > > queries on big data. However, there is still room for improvement. > Markets > > need more faster and efficient solutions. Recently, some alternatives > > (e.g., Cloudera's Impala and Amazon Redshift) have come out. > > > > > > = Rationale = > > > > There are a variety of open source distributed execution engines (e.g., > > hive, and pig) running on the top of MapReduce. They are limited by MR > > framework. They cannot directly control distributed execution and data > > flow, and they just use MR framework. So, they have limited query > > evaluation strategies and optimization opportunities. It is hard for them > > to be optimized for a certain type of data processing. > > > > > > = Initial Goals = > > > > The initial goal is to write more documents to describe Tajo's internal. > It > > will be helpful to recruit more committers and to build a solid > community. > > Then, we will make milestones for short/long term plans. > > > > > > = Current Status = > > > > Tajo is in the alpha stage. Users can execute usual SQL queries (e.g., > > selection, projection, group-by, join, union and sort) except for nested > > queries. Tajo provides various row/column storage formats, such as CSV, > > RowFile (a row-store file we have implemented), RCFile, and Trevni, and > it > > also has a rudimentary ETL feature to transform one data format to > another > > data format. In addition, Tajo provides hash and range repartitions. By > > using both repartition methods, Tajo processes aggregation, join, and > sort > > queries over a number of cluster nodes. To evaluate the performance, we > > have carried out benchmark test using TPC-H 1TB on 32 cluster nodes. > > > > > > == Meritocracy == > > > > We will discuss the milestone and the future plan in an open forum. We > plan > > to encourage an environment that supports a meritocracy. The contributors > > will have different privileges according to their contributions. > > > > > > == Community == > > > > Big data analysis has gained attention from open source communities, > > industrial and academic areas. Some projects related to Hadoop already > have > > very large and active communities. We expect that Tajo also will > establish > > an active community. Since Tajo already works for some features and is in > > the alpha stage, it will attract a large community soon. > > > > > > == Core Developers == > > > > Core developers are a diverse group of developers, many of which are very > > experienced in open source and the Apache Hadoop ecosystem. > > > > * Eli Reisman <ereisman AT apache DOT org> > > > > * Henry Saputra <hsaputra AT apache DOT org> > > > > * Hyunsik Choi <hyunsik AT apache DOT org> > > > > * Jae Hwa Jung <jhjung AT gruter DOT com> > > > > * Jihoon Son <ghoonson AT gmail DOT com> > > > > * Jin Ho Kim <jhkim AT gruter DOT com> > > > > * Roshan Sumbaly <rsumbaly AT gmail DOT com> > > > > * Sangwook Kim <swkim AT inervit DOT com> > > > > * Yi A Liu <yi DOT a DOT liu AT intel DOT com> > > > > > > == Alignment == > > > > Tajo employs Apache Hadoop Yarn as a resource management platform for > large > > clusters. It uses HDFS as a primary storage layer. It already supports > > Hadoop-related data formats (RCFile, Trevni) and will support ORC file. > In > > addition, we have a plan to integrate Tajo with other products of Hadoop > > ecosystem. Tajo's modules are well organized, and these modules can also > be > > used for other projects. > > > > > > = Known Risks = > > > > == Orphaned Products == > > > > Most of codes have been developed by only two core developers, who are > > Hyunsik Choi and Jihoon Son. It may be a risk of being orphaned. However, > > they are guaranteed to have enough time to develop Tajo for years. As you > > can see the commit history, they have participated in this project for > > about two years. In addition, the initial committers are diverse, and > Tajo > > has been supported by two IT companies in South Korea. So, the risk of > > being orphaned is very low. Later, we will be eager to recruit additional > > committers in order to eliminate this risk. > > > > > > == Inexperience with Open Source == > > > > Most of the initial committers have experience working on open source > > projects. In particular, Eli, Henry, and Hyunsik have experience as > > committers and PMC members on other Apache projects. > > > > > > == Homogeneous Developers == > > > > Although they are a diverse group of developers, what a half of core > > developers are in South Korea may be a risk. This is because their > offline > > activities are limited due to their location. Since we surely recognize > > this risk, we will write more complete documents and presentation > materials > > in order to disseminate Tajo's internal and users guide. In addition, to > > mitigate this risk we will be eager to recruit additional committers > around > > the world. > > > > > > == Reliance on Salaried Developers == > > > > It is expected that Tajo development will occur on both salaried time and > > on volunteer time. Hyunsik and Jihoon belong to Database lab., Korea > Univ. > > They will be paid by the lab to contribute Tajo for years. Jin Ho and > > Sangwook are paid by their employer to contribute to this project. Other > > developers will contribute to this project on volunteer time. In > addition, > > we will be eager to recruit additional committers including salaried and > > non-salaried developers. > > > > > > == Relationships with Other Apache Products == > > > > Tajo has some overlapping function with Apache Incubator Drill. However, > > Tajo is even more mature than Drill. In addition, there are some > > significant differences. Drill is a distributed system specialized for > > low-latency query processing by using column operations and intermediate > > data streaming. Drill has very simple query optimizer. However, some > > queries including big-big table join and sort are not available in that > > manner. Drill will support some of query types. > > > > In contrast, Tajo has advanced query optimization system. Tajo mainly > aims > > at scalable and efficient processing on all query types. By using the > query > > optimizer, Tajo will only chase low latency query processing for some > query > > types that can be executed in online aggregation manner. > > > > Besides, Tez has some overlapping functions with Tajo. However, Tez is in > > the pre-alpha stage and may be a prototype. When Tez becomes feasible, > Tajo > > could use Tez as an underlying framework according to the applicability. > > However, Tajo will still use its row/native columnar execution engine and > > its optimizer. Tajo may be potentially the first application of Tez. > > > > > > == A Excessive Fascination with the Apache Brand == > > > > We believe that the Apache brand will help us to find contributors and to > > grow the community. The community and development process will make this > > project more stable and help establish ubiquitous APIs. In addition, Tajo > > depends other project in Apache Hadoop ecosystem. We expect that > > cooperative work occurs with other projects in the same place. > > > > > > = Documentation = > > > > Tajo's demonstration paper was accepted to IEEE ICDE 2013. Since this > > conference will be held in April 2013, we cannot publicly show the paper. > > Instead, we attached some presentation material. Checkout this slide ( > > http://www.slideshare.net/hyunsikchoi/tajo-intro) > > > > In addition, some documents (e.g., getting started) are available at > > http://tajo-project.github.com/tajo/. > > > > > > = Initial Source = > > > > The initial source code has been developed in the Database Lab. Korea > Univ. > > This is implemented in Java and has almost 100,000 lines except for > parser > > and protobuf generated codes. Currently, initial source code is already > > available on GitHub at [[https://github.com/tajo-project/tajo]]. > > > > > > = Source and Intellectual Property Submission Plan = > > > > We intend the entire code base to be licensed under the Apache License, > > Version 2.0. > > > > > > = External Dependencies = > > > > The required dependencies are all Apache compatible licenses. The > following > > components with non-Apache licenses are enumerated: > > > > * Google Guava > > > > * Google Protocol Buffer > > > > * Antlr > > > > * Mockito > > > > * JLine2 > > > > > > = Cryptography = > > > > Tajo will depend on secure Hadoop that can optionally use Kerberos. > > > > > > = Required Resources = > > > > == Mailling List == > > > > * tajo-private (with moderated subscriptions) > > > > * tajo-dev > > > > * tajo-commits > > > > > > == Subversion Directory == > > > > https://git-wip-us.apache.org/repos/asf/tajo.git > > > > > > == Issue Tracking == > > > > Jira Tajo (TAJO) > > > > > > == Other Resources == > > > > * Continuous Integration > > > > * Jenkins > > > > * Wiki > > > > * http://wiki.apache.org/tajo > > > > > > = Initial Committers = > > > > * Eli Reisman <ereisman AT apache DOT org> > > > > * Henry Saputra <hsaputra AT apache DOT org> > > > > * Hyunsik Choi <hyunsik AT apache DOT org> > > > > * Jae Hwa Jung <jhjung AT gruter DOT com> > > > > * Jihoon Son <ghoonson AT gmail DOT com> > > > > * Jin Ho Kim <jhkim AT gruter DOT com> > > > > * Roshan Sumbaly <rsumbaly AT gmail DOT com> > > > > * Sangwook Kim <swkim AT inervit DOT com> > > > > * Yi A Liu <yi DOT a DOT liu AT intel DOT com> > > > > > > = Affiliations = > > > > * Eli Reisman (Hortonworks) > > > > * Henry Saputra (Platfora) > > > > * Hyunsik Choi (Database Lab., Korea University) > > > > * Jae Hwa Jung (Gruter) > > > > * Jihoon Son (Database Lab., Korea University) > > > > * Jin Ho Kim (Gruter) > > > > * Roshan Sumbaly (LinkedIn) > > > > * Sangwook Kim (Inervit) > > > > * Yi A Liu (Intel) > > > > > > The nominated mentors are employees of NASA JPL, LinkedIn, and > Hortonworks. > > > > * Chris Mattmann - NASA JPL > > > > * Jakob Homan - LinkedIn > > > > * Owen O'Malley - Hortonworks > > > > > > = Sponsors = > > > > == Champion == > > > > * Jakob Homan <ghoman AT apache DOT org> > > > > > > == Nominated Mentors == > > > > * Chris Mattmann <chris DOT a DOT mattmann AT jpl DOT nasa DOT gov> > > > > * Jakob Homan <jghoman AT apache DOT org> > > > > * Owen O'Malley <omalley AT apache DOT org> > > > > > > == Sponsoring Entity == > > > > Apache Incubator > > --------------------------------------------------------------------- > To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org > For additional commands, e-mail: general-h...@incubator.apache.org > >