Hi All, I am working on below carbondata store size optimization to reduce the size of the carbondata file which will improve IO performance during query.
*1. String/Varchar store size optimization* *Problem:* Currently String/Varchar data type values are stored in LV format in carbondata file and during query, first it calculates offset(position of each cell value) of each value in a page, which is impacting query performance and storage size is also high as we cannot apply any encoding on length part as it is stored along with the data. *Solution:* Store length part separately from data part and apply adaptive on length. This will optimize store size and during query offset calculation will be much faster as only need to look in length pat. It will improve query performance. *2. Adaptive encoding for Global/Direct/Local dictionary columns* *Problem:* Global/Direct/Local dictionary are stored in binary format and only snappy is applied for compression. As Global/Direct/Local dictionary values are of Integer data type it can adaptability stored with the data type smaller than Integer. *Solution:* Add adaptive for global/direct dictionary column to reduce the store size. *3. Local dictionary for Primitive data type columns* Currently in carbondata local dictionary is not supported for primitive columns(supported only for String datatype column). For low cardinality columns, local dictionary encoding will be effective and adaptive can be applied on top it. It will reduce the store size. Any suggestion from community is most welcomed. -Regards Kumar Vishal
