I think that this vote should be invalidated. Are the five binding votes from IPMC members?
It is not 5. Also, should proposals with only 1 valid Mentor be acceptable? If two members of the IPMC step forward to Mentor MRQL I might change my mind. Regards, Dave On Mar 12, 2013, at 2:12 AM, Edward J. Yoon wrote: > The VOTE has passed with 5 binding +1's and no -1s. > > I'll start the work to get the podling started. > > Thank you. > > On Sun, Mar 10, 2013 at 3:48 AM, Mattmann, Chris A (388J) > <chris.a.mattm...@jpl.nasa.gov> wrote: >> +1 from me (binding). >> >> Good luck! >> >> Cheers, >> Chris >> >> >> On 3/6/13 9:04 AM, "Leonidas Fegaras" <fega...@cse.uta.edu> wrote: >> >>> Dear ASF members, >>> I would like to call for a VOTE for acceptance of MRQL into the >>> Incubator. >>> The vote will close on Monday March 11, 2013. >>> >>> [ ] +1 Accept MRQL into the Apache incubator >>> [ ] +0 Don't care. >>> [ ] -1 Don't accept MRQL into the incubator because... >>> >>> Full proposal is pasted below and the corresponding wiki is >>> >>> http://wiki.apache.org/incubator/MRQLProposal >>> >>> Only VOTEs from Incubator PMC members are binding, >>> but all are welcome to express their thoughts. >>> Sincerely, >>> Leonidas Fegaras >>> >>> >>> = Abstract = >>> >>> MRQL is a query processing and optimization system for large-scale, >>> distributed data analysis, built on top of Apache Hadoop and Hama. >>> >>> = Proposal = >>> >>> MRQL (pronounced ''miracle'') is a query processing and optimization >>> system for large-scale, distributed data analysis. MRQL (the MapReduce >>> Query Language) is an SQL-like query language for large-scale data >>> analysis on a cluster of computers. The MRQL query processing system >>> can evaluate MRQL queries in two modes: in MapReduce mode on top of >>> Apache Hadoop or in Bulk Synchronous Parallel (BSP) mode on top of >>> Apache Hama. The MRQL query language is powerful enough to express >>> most common data analysis tasks over many forms of raw ''in-situ'' >>> data, such as XML and JSON documents, binary files, and CSV >>> documents. MRQL is more powerful than other current high-level >>> MapReduce languages, such as Hive and PigLatin, since it can operate >>> on more complex data and supports more powerful query constructs, thus >>> eliminating the need for using explicit MapReduce code. With MRQL, >>> users will be able to express complex data analysis tasks, such as >>> PageRank, k-means clustering, matrix factorization, etc, using >>> SQL-like queries exclusively, while the MRQL query processing system >>> will be able to compile these queries to efficient Java code. >>> >>> = Background = >>> >>> The initial code was developed at the University of Texas of Arlington >>> (UTA) by a research team, led by Leonidas Fegaras. The software was >>> first released in May 2011. The original goal of this project was to >>> build a query processing system that translates SQL-like data analysis >>> queries to efficient workflows of MapReduce jobs. A design goal was to >>> use HDFS as the physical storage layer, without any indexing, data >>> partitioning, or data normalization, and to use Hadoop (without >>> extensions) as the run-time engine. The motivation behind this work >>> was to build a platform to test new ideas on query processing and >>> optimization techniques applicable to the MapReduce framework. >>> >>> A year ago, MRQL was extended to run on Hama. The motivation for this >>> extension was that Hadoop MapReduce jobs were required to read their >>> input and write their output on HDFS. This simplifies reliability and >>> fault tolerance but it imposes a high overhead to complex MapReduce >>> workflows and graph algorithms, such as PageRank, which require >>> repetitive jobs. In addition, Hadoop does not preserve data in memory >>> across consecutive MapReduce jobs. This restriction requires to read >>> data at every step, even when the data is constant. BSP, on the other >>> hand, does not suffer from this restriction, and, under certain >>> circumstances, allows complex repetitive algorithms to run entirely in >>> the collective memory of a cluster. Thus, the goal was to be able to >>> run the same MRQL queries in both modes, MapReduce and BSP, without >>> modifying the queries: If there are enough resources available, and >>> low latency and speed are more important than resilience, queries may >>> run in BSP mode; otherwise, the same queries may run in MapReduce >>> mode. BSP evaluation was found to be a good choice when fault >>> tolerance is not critical, data (both input and intermediate) can fit >>> in the cluster memory, and data processing requires complex/repetitive >>> steps. >>> >>> The research results of this ongoing work have already been published >>> in conferences (WebDB'11, EDBT'12, and DataCloud'12) and the authors >>> have already received positive feedback from researchers in academia >>> and industry who were attending these conferences. >>> >>> = Rationale = >>> >>> * MRQL will be the first general-purpose, SQL-like query language for >>> data analysis based on BSP. >>> Currently, many programmers prefer to code their MapReduce >>> applications in a higher-level query language, rather than an >>> algorithmic language. For instance, Pig is used for 60% of Yahoo >>> MapReduce jobs, while Hive is used for 90% of Facebook MapReduce >>> jobs. This, we believe, will also be the trend for BSP applications, >>> because, even though, in principle, the BSP model is very simple to >>> understand, it is hard to develop, optimize, and maintain non-trivial >>> BSP applications coded in a general-purpose programming >>> language. Currently, there is no widely acceptable declarative BSP >>> query language, although there are a few special-purpose BSP systems >>> for graph analysis, such as Google Pregel and Apache Giraph, for >>> machine learning, such as BSML, and for scientific data analysis. >>> >>> * MRQL can capture many complex data analysis algorithms in >>> declarative form. >>> Existing MapReduce query languages, such as HiveQL and PigLatin, >>> provide a limited syntax for operating on data collections, in the >>> form of relational joins and group-bys. Because of these limitations, >>> these languages enable users to plug-in custom MapReduce scripts into >>> their queries for those jobs that cannot be declaratively coded in >>> their query language. This nullifies the benefits of using a >>> declarative query language and may result to suboptimal, error-prone, >>> and hard-to-maintain code. More importantly, these languages are >>> inappropriate for complex scientific applications and graph analysis, >>> because they do not directly support iteration or recursion in >>> declarative form and are not able to handle complex, nested scientific >>> data, which are often semi-structured. Furthermore, current MapReduce >>> query processors apply traditional query optimization techniques that >>> may be suboptimal in a MapReduce or BSP environment. >>> >>> * The MRQL design is modular, with pluggable distributed processing >>> back-ends, query languages, and data formats. >>> MRQL aims to be both powerful and adaptable. Although Hadoop is >>> currently the most popular framework for large-scale data analysis, >>> there are a few alternatives that are currently shaping form, >>> including frameworks based on BSP (eg, Giraph, Pregel, Hama), MPI >>> (eg, OpenMPI), etc. MRQL was designed in such a way so that it will >>> be easy to support other distributed processing frameworks in the >>> future. As an evidence of this claim, the MRQL processor required >>> only 2K extra lines of Java code to support BSP evaluation. >>> >>> = Initial Goals = >>> >>> Some current goals include: >>> >>> * apply MRQL to graph analysis problems, such as k-means clustering >>> and PageRank >>> >>> * apply MRQL to large-scale scientific analysis (develop general >>> optimization techniques that can apply to matrix multiplication, >>> matrix factorization, etc) >>> >>> * process additional data formats, such as Avro, and column-based >>> stores, such as HBase >>> >>> * map MRQL to additional distributed processing frameworks, such as >>> Spark and OpenMPI >>> >>> * extend the front-end to process more query languages, such as >>> standard SQL, SPARQL, XQuery, and PigLatin >>> >>> = Current Status = >>> >>> The current MRQL release (version 0.8.10) is a beta release. It is >>> built on top of Hadoop and Hama (no extensions are needed). It >>> currently works on Hadoop up to 1.0.4 (but not on Yarn yet) and Hama >>> 0.5.0. It has only been tested on a small cluster of 20 nodes (80 >>> cores). >>> >>> == Meritocracy == >>> >>> The initial MRQL code base was developed by Leonidas Fegaras in May >>> 2011, and was continuously improved throughout the years. We will >>> reach out other potential contributors through open forums. We plan >>> to do everything possible to encourage an environment that supports a >>> meritocracy, where contributors will extend their privileges based on >>> their contribution. MRQL's modular design will facilitate the >>> strategic extensions to various modules, such as adding a standard-SQL >>> interface, introducing new optimization techniques, etc. >>> >>> == Community == >>> >>> The interest in open-source query processing systems for analyzing >>> large datasets has been steadily increased in the last few years. >>> Related Apache projects have already attracted a very large community >>> from both academia and industry. We expect that MRQL will also >>> establish an active community. Several researchers from both academia >>> and industry who are interested in using our code have already >>> contacted us. >>> >>> == Core Developers == >>> >>> The initial core developer was Leonidas Fegaras, who wrote the >>> majority of the code. He is an associate professor at UTA, with >>> interests in cloud computing, databases, web technologies, and >>> functional programming. He has an extensive knowledge and working >>> experience in building complex query processing systems for databases, >>> and compilers for functional and algorithmic programming languages. >>> >>> == Alignment == >>> >>> MRQL is built on top of two Apache projects: Hadoop and Hama. We have >>> plans to incorporate other products from the Hadoop ecosystem, such as >>> Avro and HBase. MRQL can serve as a testbed for fine-tuning and >>> evaluating the performance of the Apache Hama system. Finally, the >>> MRQL query language and processor can be used by Apache Drill as a >>> pluggable query language. >>> >>> = Known Risks = >>> >>> == Orphaned Products == >>> >>> The initial committer is from academia, which may be a risk, since >>> research in academia is publication-driven, rather than >>> product-driven. It happens very often in academic research, when a >>> project becomes outdated and doesn't produce publishable results, to >>> be abandoned in favor of new cutting-edge projects. We do not believe >>> that this will be the case for MRQL for the years to come, because it >>> can be adapted to support new query languages, new optimization >>> techniques, and new distributed back-ends, thus sustaining enough >>> research interest. Another risk is that, when graduate students who >>> write code graduate, they may leave their work undocumented and >>> unfinished. We will strive to gain enough momentum to recruit >>> additional committers from industry in order to eliminate these risks. >>> >>> == Inexperience with Open Source == >>> >>> The initial developer has been involved with various projects whose >>> source code has been released under open source license, but he has no >>> prior experience on contributing to open-source projects. With the >>> guidance from other more experienced committers and participants, we >>> expect that the meritocracy rules will have a positive influence on >>> this project. >>> >>> == Homogeneous Developers == >>> >>> The initial committer comes from academia. However, given the interest >>> we have seen in the project, we expect the diversity to improve in the >>> near future. >>> >>> == Reliance on Salaried Developers == >>> >>> Currently, the MRQL code was developed on the committer's volunteer >>> time. In the future, UTA graduate students who will do some of the >>> coding may be supported by UTA and funding agencies, such as NSF. >>> >>> == Relationships with Other Apache Products == >>> >>> MRQL has some overlapping functionality with Hive and Tajo, which are >>> Data Warehouse systems for Hadoop, and with Drill, which is an >>> interactive data analysis system that can process nested data. MRQL >>> has a more powerful data model, in which any form of nested data, such >>> as XML and JSON, can be defined as a user-defined datatype. More >>> importantly, complex data analysis tasks, such as PageRank, k-means >>> clustering, and matrix multiplication and factorization, can be >>> expressed as short SQL-like queries, while the MRQL system is able to >>> evaluate these queries efficiently. Furthermore, the MRQL system can >>> run these queries in BSP mode, in addition to MapReduce mode, thus >>> achieving low latency and speed, which are also Drill's goals. >>> Nevertheless, we will welcome and encourage any help from these >>> projects and we will be eager to make contributions to these projects >>> too. >>> >>> == An Excessive Fascination with the Apache Brand == >>> >>> The Apache brand is likely to help us find contributors and reach out >>> to the open-source community. Nevertheless, since MRQL depends on >>> Apache projects (Hadoop and Hama), it makes sense to have our software >>> available as part of this ecosystem. >>> >>> = Documentation = >>> >>> Information about MRQL can be found at http://lambda.uta.edu/mrql/ >>> >>> = Initial Source = >>> >>> The initial MRQL code has been released as part of a research project >>> developed at the University of Texas at Arlington under the Apache 2.0 >>> license for the past two years. The source code is currently hosted >>> on GitHub at: https://github.com/fegaras/mrql MRQL’s release artifact >>> would consist of a single tarball of packaging and test code. >>> >>> = External Dependencies = >>> >>> The MRQL source code is already licensed under the Apache License, >>> Version 2.0. MRQL uses JLine which is distributed under the BSD >>> license. >>> >>> = Cryptography = >>> >>> Not applicable. >>> >>> = Required Resources = >>> >>> == Mailing Lists == >>> >>> * mrql-private >>> * mrql-dev >>> * mrql-user >>> >>> == Subversion Directory == >>> >>> * Git is the preferred source control system: >>> git://git.apache.org/mrql >>> >>> == Issue Tracking == >>> >>> * A JIRA issue tracker, MRQL >>> >>> == Wiki == >>> >>> * Moinmoin wiki, http://wiki.apache.org/mrql >>> >>> = Initial Committers = >>> >>> * Leonidas Fegaras <fegaras AT cse DOT uta DOT edu> >>> * Upa Gupta <upa.gupta AT mavs DOT uta DOT edu> >>> * Edward J. Yoon <edwardyoon AT apache DOT org> >>> * Maqsood Alam <maqsoodalam AT hotmail DOT com> >>> * John Hope <john.hope AT oracle DOT com> >>> * Mark Wall <mark.wall AT oracle DOT com> >>> * Kuassi Mensah <kuassi.mensah AT oracle DOT com> >>> * Ambreesh Khanna <ambreesh.khanna AT oracle DOT com> >>> * Karthik Kambatla <kasha AT cloudera DOT com> >>> >>> = Affiliations = >>> >>> * Leonidas Fegaras (University of Texas at Arlington) >>> * Upa Gupta (University of Texas at Arlington) >>> * Edward J. Yoon (Oracle corp) >>> * Maqsood Alam (Oracle corp) >>> * John Hope (Oracle corp) >>> * Mark Wall (Oracle corp) >>> * Kuassi Mensah (Oracle corp) >>> * Ambreesh Khanna (Oracle corp) >>> * Karthik Kambatla (Cloudera) >>> >>> = Sponsors = >>> >>> == Champion == >>> >>> * Edward J. Yoon <edwardyoon AT apache DOT org> >>> >>> == Nominated Mentors == >>> >>> * Alex Karasulu <akarasulu AT apache DOT org> >>> * Edward J. Yoon <edwardyoon AT apache DOT org> >>> >>> == Sponsoring Entity == >>> >>> Incubator PMC >>> >> > > > > -- > Best Regards, Edward J. Yoon > @eddieyoon > > --------------------------------------------------------------------- > To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org > For additional commands, e-mail: general-h...@incubator.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For additional commands, e-mail: general-h...@incubator.apache.org