+1 (binding)
Best Regards! --------------------- Luke Han On Wed, May 25, 2016 at 9:44 PM, Wang, Gang1 <gang1.w...@intel.com> wrote: > +1 (no-binding) > > Best Regards > +Gary. > > -----Original Message----- > From: Cheng, Hao [mailto:hao.ch...@intel.com] > Sent: Wednesday, May 25, 2016 7:09 PM > To: general@incubator.apache.org > Subject: RE: [VOTE] Accept CarbonData into the Apache Incubator > > +1 > > -----Original Message----- > From: Jacques Nadeau [mailto:jacq...@apache.org] > Sent: Thursday, May 26, 2016 8:26 AM > To: general@incubator.apache.org > Subject: Re: [VOTE] Accept CarbonData into the Apache Incubator > > +1 (binding) > > On Wed, May 25, 2016 at 4:04 PM, John D. Ament <johndam...@apache.org> > wrote: > > > +1 > > > > On Wed, May 25, 2016 at 4:41 PM Jean-Baptiste Onofré <j...@nanthrax.net> > > wrote: > > > > > Hi all, > > > > > > following the discussion thread, I'm now calling a vote to accept > > > CarbonData into the Incubator. > > > > > > [ ] +1 Accept CarbonData into the Apache Incubator [ ] +0 Abstain [ > > > ] -1 Do not accept CarbonData into the Apache Incubator, because ... > > > > > > This vote is open for 72 hours. > > > > > > The proposal follows, you can also access the wiki page: > > > https://wiki.apache.org/incubator/CarbonDataProposal > > > > > > Thanks ! > > > Regards > > > JB > > > > > > = Apache CarbonData = > > > > > > == Abstract == > > > > > > Apache CarbonData is a new Apache Hadoop native file format for > > > faster interactive query using advanced columnar storage, index, > > > compression and encoding techniques to improve computing efficiency, > > > in turn it will help speedup queries an order of magnitude faster > > > over PetaBytes of data. > > > > > > CarbonData github address: > > > https://github.com/HuaweiBigData/carbondata > > > > > > == Background == > > > > > > Huawei is an ICT solution provider, we are committed to enhancing > > > customer experiences for telecom carriers, enterprises, and > > > consumers on big data, In order to satisfy the following customer > > > requirements, we created a new Hadoop native file format: > > > > > > * Support interactive OLAP-style query over big data in seconds. > > > * Support fast query on individual record which require touching > > > all fields. > > > * Fast data loading speed and support incremental load in period > > > of minutes. > > > * Support HDFS so that customer can leverage existing Hadoop cluster. > > > * Support time based data retention. > > > > > > Based on these requirements, we investigated existing file formats > > > in the Hadoop eco-system, but we could not find a suitable solution > > > that satisfying requirements all at the same time, so we start > > > designing CarbonData. > > > > > > == Rationale == > > > > > > CarbonData contains multiple modules, which are classified into two > > > categories: > > > > > > 1. CarbonData File Format: which contains core implementation for > > > file format such as > > > columnar,index,dictionary,encoding+compression,API for reading/writing > etc. > > > 2. CarbonData integration with big data processing framework such > > > as Apache Spark, Apache Hive etc. Apache Beam is also planned to > > > abstract the execution runtime. > > > > > > === CarbonData File Format === > > > > > > CarbonData file format is a columnar store in HDFS, it has many > > > features that a modern columnar format has, such as splittable, > > > compression schema ,complex data type etc. And CarbonData has > > > following unique > > > features: > > > > > > ==== Indexing ==== > > > > > > In order to support fast interactive query, CarbonData leverage > > > indexing technology to reduce I/O scans. CarbonData files stores > > > data along with index, the index is not stored separately but the > > > CarbonData file itself contains the index. In current > > > implementation, CarbonData supports 3 types of indexing: > > > > > > 1. Multi-dimensional Key (B+ Tree index) > > > The Data block are written in sequence to the disk and within each > > > data blocks each column block is written in sequence. Finally, the > > > metadata block for the file is written with information about byte > > > positions of each block in the file, Min-Max statistics index and > > > the start and end MDK of each data block. Since, the entire data in > > > the file is in sorted order, the start and end MDK of each data > > > block can be used to construct a B+Tree and the file can be > > > logically represented as a > > > B+Tree with the data blocks as leaf nodes (on disk) and the > > > B+remaining > > > non-leaf nodes in memory. > > > 2. Inverted index > > > Inverted index is widely used in search engine. By using this > > > index, it helps processing/query engine to do filtering inside one > HDFS block. > > > Furthermore, query acceleration for count distinct like operation is > > > made possible when combining bitmap and inverted index in query time. > > > 3. MinMax index > > > For all columns, minmax index is created so that processing/query > > > engine can skip scan that is not required. > > > > > > ==== Global Dictionary ==== > > > > > > Besides I/O reduction, CarbonData accelerates computation by using > > > global dictionary, which enables processing/query engines to perform > > > all processing on encoded data without having to convert the data > > > (Late Materialization). We have observed dramatic performance > > > improvement for OLAP analytic scenario where table contains many > > > columns in string data type. The data is converted back to the user > > > readable form just before processing/query engine returning results to > user. > > > > > > ==== Column Group ==== > > > > > > Sometimes users want to perform processing/query on multi-columns in > > > one table, for example, performing scan for individual record in > > > troubleshooting scenario. In this case, row format is more efficient > > > than columnar format since all columns will be touched by the workload. > > > To accelerate this, CarbonData supports storing a group of column in > > > row format, so data in column group is stored together and enable > > > fast retrieval. > > > > > > ==== Optimized for multiple use cases ==== > > > > > > CarbonData indices and dictionary is highly configurable. To make > > > storage optimized for different use cases, user can configure what > > > to index, so user can decide and tune the format before loading data > > > into CarbonData. > > > > > > For example > > > > > > || Use Case || Supporting Features || Interactive OLAP query || > > > || Columnar format, Multi-dimensional Key (B+ > > > Tree index), Minmax index, Inverted index || > > > || High throughput scan || Global dictionary, Minmax index || Low > > > || latency point query || Multi-dimensional Key (B+ Tree index), > > > Partitioning || > > > || Individual record query || Column group, Global dictionary || > > > > > > === BigData Processing Framework Integration === > > > > > > * CarbonData provides InputFormat/OutputFormat interfaces for > > > Reading/Writing data from the CarbonData files and at the same time > > > provides abstract API for processing data stored as Carbondata > > > format with data processing framework. > > > * CarbonData provides deep integration with Apache Spark including > > > predicate push down, column pruning, aggregation push down etc. So > > > users can use Spark SQL to connect and query from CarbonData. > > > * CarbonData can integrate with various big data Query/Processing > > > framework on Hadoop eco-system such as Apache Spark,Apache Hive etc. > > > > > > Example: > > > > > > > > https://github.com/HuaweiBigData/carbondata/blob/master/examples/src/m > > ain/scala/org/carbondata/examples/CarbonExample.scala > > > > > > == Initial Goals == > > > > > > Our initial goals are to bring CarbonData into the ASF, transition > > > internal engineering processes into the open, and foster a > > > collaborative development model according to the "Apache Way". > > > > > > == Current Status == > > > > > > CarbonData is production ready and already provide a large set of > > features. > > > The current license is already Apache 2.0. > > > > > > == Meritocracy == > > > > > > We intend to radically expand the initial developer and user > > > community by running the project in accordance with the "Apache > > > Way". Users and new contributors will be treated with respect and > > > welcomed. By participating in the community and providing quality > > > patches/support that move the project forward, they will earn merit. > > > They also will be encouraged to provide non-code contributions > > > (documentation, events, community management, etc.) and will gain > > > merit for doing so. Those with a proven support and quality track > > > record will be encouraged to become committers. > > > > > > == Community == > > > > > > If CarbonData is accepted for incubation, the primary initial goal > > > is to build a large community. We really trust that CarbonData will > > > become a key project for big data column-like platforms, and so, we > > > bet on a large community of users and developers. > > > > > > == Known Risks == > > > > > > Development has been sponsored mostly by a one company.For the > > > project to fully transition to the Apache Way governance model, > > > development must shift towards the meritocracy-centric model of > > > growing a community of contributors balanced with the needs for > > > extreme stability and core implementation coherency. > > > > > > == Orphaned products == > > > > > > Huawei is fully committed CarbonData. Moreover, Huawei has a vested > > > interest in making CarbonData succeed by driving its close > > > integration with sister ASF projects. We expect this to further > > > reduces the risk of orphaning the product. > > > > > > == Inexperience with Open Source == > > > > > > Huawei has been developing and using open source software since a > > > long time. Additionally, several ASF veterans agreed to mentor the > > > project and are listed in this proposal. The project will rely on > > > their guidance and collective wisdom to quickly transition the > > > entire team of initial committers towards practicing the Apache Way. > > > > > > == Reliance on Salaried Developers == > > > > > > Most of the contributors are paid to work in big data space. While > > > they might wander from their current employers, they are unlikely to > > > venture far from their core expertises and thus will continue to be > > > engaged with the project regardless of their current employers. > > > > > > == An Excessive Fascination with the Apache Brand == > > > > > > While we intend to leverage the Apache ‘branding’ when talking to > > > other projects as testament of our project’s ‘neutrality’, we have > > > no plans for making use of Apache brand in press releases nor > > > posting billboards advertising acceptance of CarbonData into Apache > Incubator. > > > > > > == Initial Source == > > > > > > https://github.com/HuaweiBigData/carbondata.git > > > > > > == External Dependencies == > > > > > > All external dependencies are licensed under an Apache 2.0 license > > > or Apache-compatible license. As we grow the Carbondata community we > > > will configure our build process to require and validate all > > > contributions and dependencies are licensed under the Apache 2.0 > > > license or are under an Apache-compatible license. > > > > > > * Apache Spark > > > * Apache Hadoop > > > * Apache Maven > > > * Apache Commons > > > * Apache Log4j > > > * Apache Thrift > > > * Apache Zookeeper > > > * Scala > > > * Snappy > > > * Kettle (Pentaho) > > > * Eigenbase > > > * Fastutil > > > * GSON > > > * Jmockit > > > * Junit > > > > > > == Required Resources == > > > > > > === Mailing lists === > > > > > > * priv...@carbondata.incubator.apache.org (moderated subscriptions) > > > * comm...@carbondata.incubator.apache.org > > > * d...@carbondata.incubator.apache.org > > > * iss...@carbondata.incubator.apache.org > > > > > > === Git Repository === > > > > > > * https://git-wip-us.apache.org/repos/asf/incubator-carbondata.git > > > > > > === Issue Tracking === > > > > > > * JIRA Project CarbonData (CarbonData) > > > > > > === Initial Committers === > > > > > > * Liang Chenliang > > > * Jean-Baptiste Onofré > > > * Henry Saputra > > > * Uma Maheswara Rao G > > > * Jenny MA > > > * Jacky Likun > > > * Vimal Das Kammath > > > * Jarray Qiuheng > > > > > > === Affiliations === > > > > > > * Huawei: Liang Chenliang > > > * Talend: Jean-Baptiste Onofré > > > * Ebay: Henry Saputra > > > * Intel: Uma Maheswara Rao G > > > > > > === Sponsors === > > > > > > === Champion === > > > > > > * Jean-Baptiste Onofré - Apache Member > > > > > > === Mentors === > > > > > > * Henry Saputra (eBay) > > > * Jean-Baptiste Onofré (Talend) > > > * Uma Maheswara Rao G (Intel) > > > > > > === Sponsoring Entity === > > > > > > The Apache Incubator > > > > > > -------------------------------------------------------------------- > > > - To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org > > > For additional commands, e-mail: general-h...@incubator.apache.org > > > > > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org > For additional commands, e-mail: general-h...@incubator.apache.org >