Re: [DISCUSS] CarbonData incubation proposal

Julian Hyde Thu, 19 May 2016 15:12:05 -0700

I see code derived from Mondrian in the org.carbondata.core.carbon package[1] 
(I’m familiar with Mondrian’s code structure because I wrote it). Mondrian was 
originally EPL and as such cannot be re-licensed under ASL. Everything is 
probably fine, but as part of incubation, we will need to make sure that this 
and other code has a clear progeny.


Julian

[1] 
https://github.com/HuaweiBigData/carbondata/tree/master/core/src/main/java/org/carbondata/core/carbon
 
<https://github.com/HuaweiBigData/carbondata/tree/master/core/src/main/java/org/carbondata/core/carbon>

> On May 19, 2016, at 10:04 AM, Liang Chen <chenliang...@huawei.com> wrote:
> 
> Hi Lars
> 
> Thanks for you participated in discussion.
> 
> Based on the below requirements, we investigated existing file formats in
> the Hadoop eco-system, but we could not find a suitable solution that
> satisfying requirements all at the same time, so we start designing
> CarbonData.
> R1.Support big scan & only fetch a few columns
> R2.Support primary key lookup response in sub-second. 
> R3.Support interactive OLAP-style query over big data which involve many
> filters in a query, this type of workload should response in seconds. 
> R4.Support fast individual record extraction which fetch all columns of the
> record. 
> R5.Support HDFS so that customer can leverage existing Hadoop cluster.
> 
> When we investigate Parquet/ORC, it seems they work very well for R1 and R5,
> but they does not meet for R2,R3,R4. So we designed CarbonData mainly to add
> following differentiating features:
> 
> 1.Stores data along with index: it can significantly accelerate query
> performance and reduces the I/O scans and CPU resources, where there are
> filters in the query.  CarbonData index is consisted of multiple level, a
> processing framework can leverage this index to reduce the task it needs to
> schedule and process, and it can also do skip scan in more finer grain unit
> (called blocklet) in task side scanning instead of scanning the whole file.
> 
> 2.Operable encoded data :Through supporting efficient compression and global
> encoding schemes, can query on compressed/encoded data, the data can be
> converted just before returning the results to the users, which is "late
> materialized".
> 
> 3.Column group: Allow multiple columns form a column group to store as row
> format, thus cost of column reconstructing is reduced.
> 
> 4.Supports for various use cases with one single Data format : like
> interactive OLAP-style query, Sequential Access (big scan), Random Access
> (narrow scan).
> 
> Please kindly let me know if the above info answer your questions.
> 
> Regards
> Liang
> 
> 
> 
> 
> 
> 
> --
> View this message in context: 
> http://apache-incubator-general.996316.n3.nabble.com/DISCUSS-CarbonData-incubation-proposal-tp49643p49652.html
> Sent from the Apache Incubator - General mailing list archive at Nabble.com.
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
> For additional commands, e-mail: general-h...@incubator.apache.org
>

Re: [DISCUSS] CarbonData incubation proposal

Reply via email to