Re: [VOTE] Accept CarbonData into the Apache Incubator

Luke Han Wed, 25 May 2016 22:44:22 -0700

+1 (binding)


Best Regards!
---------------------

Luke Han

On Wed, May 25, 2016 at 9:44 PM, Wang, Gang1 <gang1.w...@intel.com> wrote:

> +1 (no-binding)
>
> Best Regards
> +Gary.
>
> -----Original Message-----
> From: Cheng, Hao [mailto:hao.ch...@intel.com]
> Sent: Wednesday, May 25, 2016 7:09 PM
> To: general@incubator.apache.org
> Subject: RE: [VOTE] Accept CarbonData into the Apache Incubator
>
> +1
>
> -----Original Message-----
> From: Jacques Nadeau [mailto:jacq...@apache.org]
> Sent: Thursday, May 26, 2016 8:26 AM
> To: general@incubator.apache.org
> Subject: Re: [VOTE] Accept CarbonData into the Apache Incubator
>
> +1 (binding)
>
> On Wed, May 25, 2016 at 4:04 PM, John D. Ament <johndam...@apache.org>
> wrote:
>
> > +1
> >
> > On Wed, May 25, 2016 at 4:41 PM Jean-Baptiste Onofré <j...@nanthrax.net>
> > wrote:
> >
> > > Hi all,
> > >
> > > following the discussion thread, I'm now calling a vote to accept
> > > CarbonData into the Incubator.
> > >
> > > [ ] +1 Accept CarbonData into the Apache Incubator [ ] +0 Abstain [
> > > ] -1 Do not accept CarbonData into the Apache Incubator, because ...
> > >
> > > This vote is open for 72 hours.
> > >
> > > The proposal follows, you can also access the wiki page:
> > > https://wiki.apache.org/incubator/CarbonDataProposal
> > >
> > > Thanks !
> > > Regards
> > > JB
> > >
> > > = Apache CarbonData =
> > >
> > > == Abstract ==
> > >
> > > Apache CarbonData is a new Apache Hadoop native file format for
> > > faster interactive query using advanced columnar storage, index,
> > > compression and encoding techniques to improve computing efficiency,
> > > in turn it will help speedup queries an order of magnitude faster
> > > over PetaBytes of data.
> > >
> > > CarbonData github address:
> > > https://github.com/HuaweiBigData/carbondata
> > >
> > > == Background ==
> > >
> > > Huawei is an ICT solution provider, we are committed to enhancing
> > > customer experiences for telecom carriers, enterprises, and
> > > consumers on big data, In order to satisfy the following customer
> > > requirements, we created a new Hadoop native file format:
> > >
> > >   * Support interactive OLAP-style query over big data in seconds.
> > >   * Support fast query on individual record which require touching
> > > all fields.
> > >   * Fast data loading speed and support incremental load in period
> > > of minutes.
> > >   * Support HDFS so that customer can leverage existing Hadoop cluster.
> > >   * Support time based data retention.
> > >
> > > Based on these requirements, we investigated existing file formats
> > > in the Hadoop eco-system, but we could not find a suitable solution
> > > that satisfying requirements all at the same time, so we start
> > > designing CarbonData.
> > >
> > > == Rationale ==
> > >
> > > CarbonData contains multiple modules, which are classified into two
> > > categories:
> > >
> > >   1. CarbonData File Format: which contains core implementation for
> > > file format such as
> > > columnar,index,dictionary,encoding+compression,API for reading/writing
> etc.
> > >   2. CarbonData integration with big data processing framework such
> > > as Apache Spark, Apache Hive etc. Apache Beam is also planned to
> > > abstract the execution runtime.
> > >
> > > === CarbonData File Format ===
> > >
> > > CarbonData file format is a columnar store in HDFS, it has many
> > > features that a modern columnar format has, such as splittable,
> > > compression schema ,complex data type etc. And CarbonData has
> > > following unique
> > > features:
> > >
> > > ==== Indexing ====
> > >
> > > In order to support fast interactive query, CarbonData leverage
> > > indexing technology to reduce I/O scans. CarbonData files stores
> > > data along with index, the index is not stored separately but the
> > > CarbonData file itself contains the index. In current
> > > implementation, CarbonData supports 3 types of indexing:
> > >
> > > 1. Multi-dimensional Key (B+ Tree index)
> > >   The Data block are written in sequence to the disk and within each
> > > data blocks each column block is written in sequence. Finally, the
> > > metadata block for the file is written with information about byte
> > > positions of each block in the file, Min-Max statistics index and
> > > the start and end MDK of each data block. Since, the entire data in
> > > the file is in sorted order, the start and end MDK of each data
> > > block can be used to construct a B+Tree and the file can be
> > > logically  represented as a
> > > B+Tree with the data blocks as leaf nodes (on disk) and the
> > > B+remaining
> > > non-leaf nodes in memory.
> > > 2. Inverted index
> > >   Inverted index is widely used in search engine. By using this
> > > index, it helps processing/query engine to do filtering inside one
> HDFS block.
> > > Furthermore, query acceleration for count distinct like operation is
> > > made possible when combining bitmap and inverted index in query time.
> > > 3. MinMax index
> > >   For all columns, minmax index is created so that processing/query
> > > engine can skip scan that is not required.
> > >
> > > ==== Global Dictionary ====
> > >
> > > Besides I/O reduction, CarbonData accelerates computation by using
> > > global dictionary, which enables processing/query engines to perform
> > > all processing on encoded data without having to convert the data
> > > (Late Materialization). We have observed dramatic performance
> > > improvement for OLAP analytic scenario where table contains many
> > > columns in string data type. The data is converted back to the user
> > > readable form just before processing/query engine returning results to
> user.
> > >
> > > ==== Column Group ====
> > >
> > > Sometimes users want to perform processing/query on multi-columns in
> > > one table, for example, performing scan for individual record in
> > > troubleshooting scenario. In this case, row format is more efficient
> > > than columnar format since all columns will be touched by the workload.
> > > To accelerate this, CarbonData supports storing a group of column in
> > > row format, so data in column group is stored together and enable
> > > fast retrieval.
> > >
> > > ==== Optimized for multiple use cases ====
> > >
> > > CarbonData indices and dictionary is highly configurable. To make
> > > storage optimized for different use cases, user can configure what
> > > to index, so user can decide and tune the format before loading data
> > > into CarbonData.
> > >
> > > For example
> > >
> > > || Use Case || Supporting Features || Interactive OLAP query ||
> > > || Columnar format, Multi-dimensional Key (B+
> > > Tree index), Minmax index, Inverted index ||
> > > || High throughput scan || Global dictionary, Minmax index || Low
> > > || latency point query || Multi-dimensional Key (B+ Tree index),
> > > Partitioning ||
> > > || Individual record query || Column group, Global dictionary ||
> > >
> > > === BigData Processing Framework Integration ===
> > >
> > >   * CarbonData provides InputFormat/OutputFormat interfaces for
> > > Reading/Writing data from the CarbonData files and at the same time
> > > provides abstract API for processing data stored as Carbondata
> > > format with data processing framework.
> > >   * CarbonData provides deep integration with Apache Spark including
> > > predicate push down, column pruning, aggregation push down etc. So
> > > users can use Spark SQL to connect and query from CarbonData.
> > >   * CarbonData can integrate with various big data Query/Processing
> > > framework on Hadoop eco-system such as Apache Spark,Apache Hive etc.
> > >
> > > Example:
> > >
> > >
> > https://github.com/HuaweiBigData/carbondata/blob/master/examples/src/m
> > ain/scala/org/carbondata/examples/CarbonExample.scala
> > >
> > > == Initial Goals ==
> > >
> > > Our initial goals are to bring CarbonData into the ASF, transition
> > > internal engineering processes into the open, and foster a
> > > collaborative development model according to the "Apache Way".
> > >
> > > == Current Status ==
> > >
> > > CarbonData is production ready and already provide a large set of
> > features.
> > > The current license is already Apache 2.0.
> > >
> > > == Meritocracy ==
> > >
> > > We intend to radically expand the initial developer and user
> > > community by running the project in accordance with the "Apache
> > > Way". Users and new contributors will be treated with respect and
> > > welcomed. By participating in the community and providing quality
> > > patches/support that move the project forward, they will earn merit.
> > > They also will be encouraged to provide non-code contributions
> > > (documentation, events, community management, etc.) and will gain
> > > merit for doing so. Those with a proven support and quality track
> > > record will be encouraged to become committers.
> > >
> > > == Community ==
> > >
> > > If CarbonData is accepted for incubation, the primary initial goal
> > > is to build a large community. We really trust that CarbonData will
> > > become a key project for big data column-like platforms, and so, we
> > > bet on a large community of users and developers.
> > >
> > > == Known Risks ==
> > >
> > > Development has been sponsored mostly by a one company.For the
> > > project to fully transition to the Apache Way governance model,
> > > development must shift towards the meritocracy-centric model of
> > > growing a community of contributors balanced with the needs for
> > > extreme stability and core implementation coherency.
> > >
> > > == Orphaned products ==
> > >
> > > Huawei is fully committed CarbonData. Moreover, Huawei has a vested
> > > interest in making CarbonData succeed by driving its close
> > > integration with sister ASF projects. We expect this to further
> > > reduces the risk of orphaning the product.
> > >
> > > == Inexperience with Open Source ==
> > >
> > > Huawei has been developing and using open source software since a
> > > long time. Additionally, several ASF veterans agreed to mentor the
> > > project and are listed in this proposal. The project will rely on
> > > their guidance and collective wisdom to quickly transition the
> > > entire team of initial committers towards practicing the Apache Way.
> > >
> > > == Reliance on Salaried Developers ==
> > >
> > > Most of the contributors are paid to work in big data space. While
> > > they might wander from their current employers, they are unlikely to
> > > venture far from their core expertises and thus will continue to be
> > > engaged with the project regardless of their current employers.
> > >
> > > == An Excessive Fascination with the Apache Brand ==
> > >
> > > While we intend to leverage the Apache ‘branding’ when talking to
> > > other projects as testament of our project’s ‘neutrality’, we have
> > > no plans for making use of Apache brand in press releases nor
> > > posting billboards advertising acceptance of CarbonData into Apache
> Incubator.
> > >
> > > == Initial Source ==
> > >
> > > https://github.com/HuaweiBigData/carbondata.git
> > >
> > > == External Dependencies ==
> > >
> > > All external dependencies are licensed under an Apache 2.0 license
> > > or Apache-compatible license. As we grow the Carbondata community we
> > > will configure our build process to require and validate all
> > > contributions and dependencies are licensed under the Apache 2.0
> > > license or are under an Apache-compatible license.
> > >
> > >   * Apache Spark
> > >   * Apache Hadoop
> > >   * Apache Maven
> > >   * Apache Commons
> > >   * Apache Log4j
> > >   * Apache Thrift
> > >   * Apache Zookeeper
> > >   * Scala
> > >   * Snappy
> > >   * Kettle (Pentaho)
> > >   * Eigenbase
> > >   * Fastutil
> > >   * GSON
> > >   * Jmockit
> > >   * Junit
> > >
> > > == Required Resources ==
> > >
> > > === Mailing lists ===
> > >
> > >   * priv...@carbondata.incubator.apache.org (moderated subscriptions)
> > >   * comm...@carbondata.incubator.apache.org
> > >   * d...@carbondata.incubator.apache.org
> > >   * iss...@carbondata.incubator.apache.org
> > >
> > > === Git Repository ===
> > >
> > >   * https://git-wip-us.apache.org/repos/asf/incubator-carbondata.git
> > >
> > > === Issue Tracking ===
> > >
> > >   * JIRA Project CarbonData (CarbonData)
> > >
> > > === Initial Committers ===
> > >
> > >   * Liang Chenliang
> > >   * Jean-Baptiste Onofré
> > >   * Henry Saputra
> > >   * Uma Maheswara Rao G
> > >   * Jenny MA
> > >   * Jacky Likun
> > >   * Vimal Das Kammath
> > >   * Jarray Qiuheng
> > >
> > > === Affiliations ===
> > >
> > >   * Huawei: Liang Chenliang
> > >   * Talend: Jean-Baptiste Onofré
> > >   * Ebay: Henry Saputra
> > >   * Intel: Uma Maheswara Rao G
> > >
> > > === Sponsors ===
> > >
> > > === Champion ===
> > >
> > >   * Jean-Baptiste Onofré - Apache Member
> > >
> > > === Mentors ===
> > >
> > >   * Henry Saputra (eBay)
> > >   * Jean-Baptiste Onofré (Talend)
> > >   * Uma Maheswara Rao G (Intel)
> > >
> > > === Sponsoring Entity ===
> > >
> > > The Apache Incubator
> > >
> > > --------------------------------------------------------------------
> > > - To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
> > > For additional commands, e-mail: general-h...@incubator.apache.org
> > >
> > >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
> For additional commands, e-mail: general-h...@incubator.apache.org
>

Re: [VOTE] Accept CarbonData into the Apache Incubator

Reply via email to