Re: [VOTE] Accept CarbonData into the Apache Incubator

Jim Jagielski Fri, 27 May 2016 05:54:07 -0700

Thx for the feedback...

I change my vote to +1 (binding)
> On May 27, 2016, at 1:46 AM, Jean-Baptiste Onofré <j...@nanthrax.net> wrote:
> 
> Hi Jim,
> 
> good point. Let me try to explain this "gap" regarding my discussion with the 
> team:
> 
> 1. Some people have been involved mostly in architecture and design more 
> directly in code. That's why they are part of the initial committer list, 
> whereas they didn't really provide "visible" code on github.
> 
> 2. Some people are no more involved in the project. That's why they don't 
> appear on the initial committer list.
> 
> Regards
> JB
> 
> On 05/26/2016 05:45 PM, Jim Jagielski wrote:
>> I am trying to align the list of initial committers with
>> the list of current/active contributors, according to
>> Github, and I am seeing people proposed who have not
>> contributed anything and people NOT proposed who seem
>> to be kinda active...
>> 
>> Sooo..... -0
>> 
>>> On May 25, 2016, at 4:24 PM, Jean-Baptiste Onofré <j...@nanthrax.net> wrote:
>>> 
>>> Hi all,
>>> 
>>> following the discussion thread, I'm now calling a vote to accept 
>>> CarbonData into the Incubator.
>>> 
>>> [ ] +1 Accept CarbonData into the Apache Incubator
>>> [ ] +0 Abstain
>>> [ ] -1 Do not accept CarbonData into the Apache Incubator, because ...
>>> 
>>> This vote is open for 72 hours.
>>> 
>>> The proposal follows, you can also access the wiki page:
>>> https://wiki.apache.org/incubator/CarbonDataProposal
>>> 
>>> Thanks !
>>> Regards
>>> JB
>>> 
>>> = Apache CarbonData =
>>> 
>>> == Abstract ==
>>> 
>>> Apache CarbonData is a new Apache Hadoop native file format for faster 
>>> interactive
>>> query using advanced columnar storage, index, compression and encoding 
>>> techniques
>>> to improve computing efficiency, in turn it will help speedup queries an 
>>> order of
>>> magnitude faster over PetaBytes of data.
>>> 
>>> CarbonData github address: https://github.com/HuaweiBigData/carbondata
>>> 
>>> == Background ==
>>> 
>>> Huawei is an ICT solution provider, we are committed to enhancing customer 
>>> experiences for telecom carriers, enterprises, and consumers on big data, 
>>> In order to satisfy the following customer requirements, we created a new 
>>> Hadoop native file format:
>>> 
>>> * Support interactive OLAP-style query over big data in seconds.
>>> * Support fast query on individual record which require touching all fields.
>>> * Fast data loading speed and support incremental load in period of minutes.
>>> * Support HDFS so that customer can leverage existing Hadoop cluster.
>>> * Support time based data retention.
>>> 
>>> Based on these requirements, we investigated existing file formats in the 
>>> Hadoop eco-system, but we could not find a suitable solution that 
>>> satisfying requirements all at the same time, so we start designing 
>>> CarbonData.
>>> 
>>> == Rationale ==
>>> 
>>> CarbonData contains multiple modules, which are classified into two 
>>> categories:
>>> 
>>> 1. CarbonData File Format: which contains core implementation for file 
>>> format such as columnar,index,dictionary,encoding+compression,API for 
>>> reading/writing etc.
>>> 2. CarbonData integration with big data processing framework such as Apache 
>>> Spark, Apache Hive etc. Apache Beam is also planned to abstract the 
>>> execution runtime.
>>> 
>>> === CarbonData File Format ===
>>> 
>>> CarbonData file format is a columnar store in HDFS, it has many features 
>>> that a modern columnar format has, such as splittable, compression schema 
>>> ,complex data type etc. And CarbonData has following unique features:
>>> 
>>> ==== Indexing ====
>>> 
>>> In order to support fast interactive query, CarbonData leverage indexing 
>>> technology to reduce I/O scans. CarbonData files stores data along with 
>>> index, the index is not stored separately but the CarbonData file itself 
>>> contains the index. In current implementation, CarbonData supports 3 types 
>>> of indexing:
>>> 
>>> 1. Multi-dimensional Key (B+ Tree index)
>>> The Data block are written in sequence to the disk and within each data 
>>> blocks each column block is written in sequence. Finally, the metadata 
>>> block for the file is written with information about byte positions of each 
>>> block in the file, Min-Max statistics index and the start and end MDK of 
>>> each data block. Since, the entire data in the file is in sorted order, the 
>>> start and end MDK of each data block can be used to construct a B+Tree and 
>>> the file can be logically  represented as a B+Tree with the data blocks as 
>>> leaf nodes (on disk) and the remaining non-leaf nodes in memory.
>>> 2. Inverted index
>>> Inverted index is widely used in search engine. By using this index, it 
>>> helps processing/query engine to do filtering inside one HDFS block. 
>>> Furthermore, query acceleration for count distinct like operation is made 
>>> possible when combining bitmap and inverted index in query time.
>>> 3. MinMax index
>>> For all columns, minmax index is created so that processing/query engine 
>>> can skip scan that is not required.
>>> 
>>> ==== Global Dictionary ====
>>> 
>>> Besides I/O reduction, CarbonData accelerates computation by using global 
>>> dictionary, which enables processing/query engines to perform all 
>>> processing on encoded data without having to convert the data (Late 
>>> Materialization). We have observed dramatic performance improvement for 
>>> OLAP analytic scenario where table contains many columns in string data 
>>> type. The data is converted back to the user readable form just before 
>>> processing/query engine returning results to user.
>>> 
>>> ==== Column Group ====
>>> 
>>> Sometimes users want to perform processing/query on multi-columns in one 
>>> table, for example, performing scan for individual record in 
>>> troubleshooting scenario. In this case, row format is more efficient than 
>>> columnar format since all columns will be touched by the workload. To 
>>> accelerate this, CarbonData supports storing a group of column in row 
>>> format, so data in column group is stored together and enable fast 
>>> retrieval.
>>> 
>>> ==== Optimized for multiple use cases ====
>>> 
>>> CarbonData indices and dictionary is highly configurable. To make storage 
>>> optimized for different use cases, user can configure what to index, so 
>>> user can decide and tune the format before loading data into CarbonData.
>>> 
>>> For example
>>> 
>>> || Use Case || Supporting Features ||
>>> || Interactive OLAP query || Columnar format, Multi-dimensional Key (B+ 
>>> Tree index), Minmax index, Inverted index ||
>>> || High throughput scan || Global dictionary, Minmax index ||
>>> || Low latency point query || Multi-dimensional Key (B+ Tree index), 
>>> Partitioning ||
>>> || Individual record query || Column group, Global dictionary ||
>>> 
>>> === BigData Processing Framework Integration ===
>>> 
>>> * CarbonData provides InputFormat/OutputFormat interfaces for 
>>> Reading/Writing data from the CarbonData files and at the same time 
>>> provides abstract API for processing data stored as Carbondata format with 
>>> data processing framework.
>>> * CarbonData provides deep integration with Apache Spark including 
>>> predicate push down, column pruning, aggregation push down etc. So users 
>>> can use Spark SQL to connect and query from CarbonData.
>>> * CarbonData can integrate with various big data Query/Processing framework 
>>> on Hadoop eco-system such as Apache Spark,Apache Hive etc.
>>> 
>>> Example: 
>>> https://github.com/HuaweiBigData/carbondata/blob/master/examples/src/main/scala/org/carbondata/examples/CarbonExample.scala
>>> 
>>> == Initial Goals ==
>>> 
>>> Our initial goals are to bring CarbonData into the ASF, transition internal 
>>> engineering processes into the open, and foster a collaborative development 
>>> model according to the "Apache Way".
>>> 
>>> == Current Status ==
>>> 
>>> CarbonData is production ready and already provide a large set of features.
>>> The current license is already Apache 2.0.
>>> 
>>> == Meritocracy ==
>>> 
>>> We intend to radically expand the initial developer and user community by 
>>> running the project in accordance with the "Apache Way". Users and new 
>>> contributors will be treated with respect and welcomed. By participating in 
>>> the community and providing quality patches/support that move the project 
>>> forward, they will earn merit. They also will be encouraged to provide 
>>> non-code contributions (documentation, events, community management, etc.) 
>>> and will gain merit for doing so. Those with a proven support and quality 
>>> track record will be encouraged to become committers.
>>> 
>>> == Community ==
>>> 
>>> If CarbonData is accepted for incubation, the primary initial goal is to 
>>> build a large community. We really trust that CarbonData will become a key 
>>> project for big data column-like platforms, and so, we bet on a large 
>>> community of users and developers.
>>> 
>>> == Known Risks ==
>>> 
>>> Development has been sponsored mostly by a one company.For the project to 
>>> fully transition to the Apache Way governance model, development must shift 
>>> towards the meritocracy-centric model of growing a community of 
>>> contributors balanced with the needs for extreme stability and core 
>>> implementation coherency.
>>> 
>>> == Orphaned products ==
>>> 
>>> Huawei is fully committed CarbonData. Moreover, Huawei has a vested 
>>> interest in making CarbonData succeed by driving its close integration with 
>>> sister ASF projects. We expect this to further reduces the risk of 
>>> orphaning the product.
>>> 
>>> == Inexperience with Open Source ==
>>> 
>>> Huawei has been developing and using open source software since a long 
>>> time. Additionally, several ASF veterans agreed to mentor the project and 
>>> are listed in this proposal. The project will rely on their guidance and 
>>> collective wisdom to quickly transition the entire team of initial 
>>> committers towards practicing the Apache Way.
>>> 
>>> == Reliance on Salaried Developers ==
>>> 
>>> Most of the contributors are paid to work in big data space. While they 
>>> might wander from their current employers, they are unlikely to venture far 
>>> from their core expertises and thus will continue to be engaged with the 
>>> project regardless of their current employers.
>>> 
>>> == An Excessive Fascination with the Apache Brand ==
>>> 
>>> While we intend to leverage the Apache ‘branding’ when talking to other 
>>> projects as testament of our project’s ‘neutrality’, we have no plans for 
>>> making use of Apache brand in press releases nor posting billboards 
>>> advertising acceptance of CarbonData into Apache Incubator.
>>> 
>>> == Initial Source ==
>>> 
>>> https://github.com/HuaweiBigData/carbondata.git
>>> 
>>> == External Dependencies ==
>>> 
>>> All external dependencies are licensed under an Apache 2.0 license or
>>> Apache-compatible license. As we grow the Carbondata community we will
>>> configure our build process to require and validate all contributions
>>> and dependencies are licensed under the Apache 2.0 license or are under
>>> an Apache-compatible license.
>>> 
>>> * Apache Spark
>>> * Apache Hadoop
>>> * Apache Maven
>>> * Apache Commons
>>> * Apache Log4j
>>> * Apache Thrift
>>> * Apache Zookeeper
>>> * Scala
>>> * Snappy
>>> * Kettle (Pentaho)
>>> * Eigenbase
>>> * Fastutil
>>> * GSON
>>> * Jmockit
>>> * Junit
>>> 
>>> == Required Resources ==
>>> 
>>> === Mailing lists ===
>>> 
>>> * priv...@carbondata.incubator.apache.org (moderated subscriptions)
>>> * comm...@carbondata.incubator.apache.org
>>> * d...@carbondata.incubator.apache.org
>>> * iss...@carbondata.incubator.apache.org
>>> 
>>> === Git Repository ===
>>> 
>>> * https://git-wip-us.apache.org/repos/asf/incubator-carbondata.git
>>> 
>>> === Issue Tracking ===
>>> 
>>> * JIRA Project CarbonData (CarbonData)
>>> 
>>> === Initial Committers ===
>>> 
>>> * Liang Chenliang
>>> * Jean-Baptiste Onofré
>>> * Henry Saputra
>>> * Uma Maheswara Rao G
>>> * Jenny MA
>>> * Jacky Likun
>>> * Vimal Das Kammath
>>> * Jarray Qiuheng
>>> 
>>> === Affiliations ===
>>> 
>>> * Huawei: Liang Chenliang
>>> * Talend: Jean-Baptiste Onofré
>>> * Ebay: Henry Saputra
>>> * Intel: Uma Maheswara Rao G
>>> 
>>> === Sponsors ===
>>> 
>>> === Champion ===
>>> 
>>> * Jean-Baptiste Onofré - Apache Member
>>> 
>>> === Mentors ===
>>> 
>>> * Henry Saputra (eBay)
>>> * Jean-Baptiste Onofré (Talend)
>>> * Uma Maheswara Rao G (Intel)
>>> 
>>> === Sponsoring Entity ===
>>> 
>>> The Apache Incubator
>>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
>>> For additional commands, e-mail: general-h...@incubator.apache.org
>>> 
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
>> For additional commands, e-mail: general-h...@incubator.apache.org
>> 
> 
> -- 
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
> For additional commands, e-mail: general-h...@incubator.apache.org
>



---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org

Re: [VOTE] Accept CarbonData into the Apache Incubator

Reply via email to