+1 Compression is quite important for the scan performance. I think all your listed points are valid. Please feel free to contribute.
Regards, Jacky > 在 2018年9月12日,下午5:09,Kumar Vishal <[email protected]> 写道: > > Hi All, > I am working on below carbondata store size optimization to reduce the size > of the carbondata file which will improve IO performance during query. > > *1. String/Varchar store size optimization* > *Problem:* > Currently String/Varchar data type values are stored in LV format in > carbondata file and during query, first it calculates offset(position of > each cell value) of each value in a page, which is impacting query > performance and storage size is also high as we cannot apply any encoding > on length part as it is stored along with the data. > *Solution:* > Store length part separately from data part and apply adaptive on length. > This will optimize store size and during query offset calculation will be > much faster as only need to look in length pat. It will improve query > performance. > > *2. Adaptive encoding for Global/Direct/Local dictionary columns* > *Problem:* > Global/Direct/Local dictionary are stored in binary format and only snappy > is applied for compression. As Global/Direct/Local dictionary values are of > Integer data type it can adaptability stored with the data type smaller > than Integer. > *Solution:* > Add adaptive for global/direct dictionary column to reduce the store size. > > *3. Local dictionary for Primitive data type columns* > Currently in carbondata local dictionary is not supported for primitive > columns(supported only for String datatype column). For low cardinality > columns, local dictionary encoding will be effective and adaptive can be > applied on top it. It will reduce the store size. > > Any suggestion from community is most welcomed. > > -Regards > Kumar Vishal >
