EmmyMiao87 commented on a change in pull request #4198: URL: https://github.com/apache/incubator-doris/pull/4198#discussion_r462711408
########## File path: docs/zh-CN/extending-doris/udf/contrib/udaf-orthogonal-bitmap-manual.md ########## @@ -0,0 +1,209 @@ +--- +{ + "title": "BITMAP正交计算UDAF", + "language": "zh-CN" +} +--- + +<!-- +Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. +--> + +# BITMAP正交计算UDAF Review comment: same as above ########## File path: docs/zh-CN/extending-doris/udf/contrib/udaf-orthogonal-bitmap-manual.md ########## @@ -0,0 +1,209 @@ +--- +{ + "title": "BITMAP正交计算UDAF", + "language": "zh-CN" +} +--- + +<!-- +Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. +--> + +# BITMAP正交计算UDAF + +## 背景 + +Doris原有的Bitmap聚合函数设计比较通用,但对亿级别以上bitmap大基数的交集和并集计算性能较差。排查后端be的bitmap聚合函数逻辑,发现主要有两个原因。一是当bitmap基数较大时,如数据大小超过1g,网络/磁盘IO处理时间比较长;二是后端be实例在scan数据后全部传输到顶层节点进行求交和并运算,给顶层单节点带来压力,成为处理瓶颈。 + +解决方案是建表时增加hid列,罐库时hid列按照bitmap列的range划分,并且按hid均匀分桶。这样按range划分的聚合bitmap数据会均匀地分布在所有后端be实例上。在schema表的基础上,优化udaf聚合函数,使其在所有扫描节点参与分布式正交并算,然后在顶层节点进行汇总,如此会大大提高计算效率。 + +## Create table + +建表时需要使用聚合模型,数据类型是 bitmap , 聚合函数是 bitmap_union + +``` +CREATE TABLE `user_tag_bitmap` ( + `tag` bigint(20) NULL COMMENT "用户标签", + `hid` smallint(6) NULL COMMENT "分桶id", + `user_id` bitmap BITMAP_UNION NULL COMMENT "" +) ENGINE=OLAP +AGGREGATE KEY(`tag`, `hid`) +COMMENT "OLAP" +DISTRIBUTED BY HASH(`hid`) BUCKETS 3 +``` +表schema增加hid列,表示id范围, 作为hash分桶列。 + +注:hid数和BUCKETS要设置合理,hid数设置至少是BUCKETS的5倍以上,以使数据hash分桶尽量均衡 + +## Data Load + +``` +LOAD LABEL user_tag_bitmap_test +( +DATA INFILE('hdfs://abc') +INTO TABLE user_tag_bitmap +COLUMNS TERMINATED BY ',' +(tmp_tag, tmp_user_id) +SET ( +tag = tmp_tag, +hid = ceil(tmp_user_id/5000000), +user_id = to_bitmap(tmp_user_id) +) +) +... +``` +数据格式: +``` +11111111,1 +11111112,2 +11111113,3 +11111114,4 +... +``` +注:第一列代表用户标签,如'男', '90后', '10-20万'等,已由中文转换成数字 + +load数据时,对用户bitmap进行纵向切割,例如,用户id在1-5000000范围内的hid相同,hid相同的会被均匀的hash分配后端be实例进行union聚合。在bitmap的udaf实现上,可以利用tablet在be上平均分散的特性,在local节点scan数据后,直接进行交集、并集计算,在top节点merge阶段进行汇总计算结果,此设计能充分发挥所有be并发计算的特性。 + +## 自定义UDAF +Doris查询前设置参数 +``` +set parallel_fragment_exec_instance_num=5 +``` +注:根据集群情况设置并发参数,提高并发计算性能 + +新udaf需要在doris定义聚合函数时注册函数符号,函数符号通过动态库.so的方式被加载。 + +### bitmap_orthogonal_intersect + Review comment: 首先需要有函数的介绍,就是这个函数的行为是什么?是用来干啥的 ########## File path: docs/zh-CN/extending-doris/udf/contrib/udaf-orthogonal-bitmap-manual.md ########## @@ -0,0 +1,209 @@ +--- +{ + "title": "BITMAP正交计算UDAF", + "language": "zh-CN" +} +--- + +<!-- +Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. +--> + +# BITMAP正交计算UDAF + +## 背景 + +Doris原有的Bitmap聚合函数设计比较通用,但对亿级别以上bitmap大基数的交集和并集计算性能较差。排查后端be的bitmap聚合函数逻辑,发现主要有两个原因。一是当bitmap基数较大时,如数据大小超过1g,网络/磁盘IO处理时间比较长;二是后端be实例在scan数据后全部传输到顶层节点进行求交和并运算,给顶层单节点带来压力,成为处理瓶颈。 + +解决方案是建表时增加hid列,罐库时hid列按照bitmap列的range划分,并且按hid均匀分桶。这样按range划分的聚合bitmap数据会均匀地分布在所有后端be实例上。在schema表的基础上,优化udaf聚合函数,使其在所有扫描节点参与分布式正交并算,然后在顶层节点进行汇总,如此会大大提高计算效率。 + +## Create table + +建表时需要使用聚合模型,数据类型是 bitmap , 聚合函数是 bitmap_union + +``` +CREATE TABLE `user_tag_bitmap` ( + `tag` bigint(20) NULL COMMENT "用户标签", + `hid` smallint(6) NULL COMMENT "分桶id", + `user_id` bitmap BITMAP_UNION NULL COMMENT "" +) ENGINE=OLAP +AGGREGATE KEY(`tag`, `hid`) +COMMENT "OLAP" +DISTRIBUTED BY HASH(`hid`) BUCKETS 3 +``` +表schema增加hid列,表示id范围, 作为hash分桶列。 + +注:hid数和BUCKETS要设置合理,hid数设置至少是BUCKETS的5倍以上,以使数据hash分桶尽量均衡 + +## Data Load + +``` +LOAD LABEL user_tag_bitmap_test +( +DATA INFILE('hdfs://abc') +INTO TABLE user_tag_bitmap +COLUMNS TERMINATED BY ',' +(tmp_tag, tmp_user_id) +SET ( +tag = tmp_tag, +hid = ceil(tmp_user_id/5000000), +user_id = to_bitmap(tmp_user_id) +) +) +... +``` +数据格式: +``` +11111111,1 +11111112,2 +11111113,3 +11111114,4 +... +``` +注:第一列代表用户标签,如'男', '90后', '10-20万'等,已由中文转换成数字 + +load数据时,对用户bitmap进行纵向切割,例如,用户id在1-5000000范围内的hid相同,hid相同的会被均匀的hash分配后端be实例进行union聚合。在bitmap的udaf实现上,可以利用tablet在be上平均分散的特性,在local节点scan数据后,直接进行交集、并集计算,在top节点merge阶段进行汇总计算结果,此设计能充分发挥所有be并发计算的特性。 + +## 自定义UDAF +Doris查询前设置参数 +``` +set parallel_fragment_exec_instance_num=5 +``` +注:根据集群情况设置并发参数,提高并发计算性能 + +新udaf需要在doris定义聚合函数时注册函数符号,函数符号通过动态库.so的方式被加载。 + +### bitmap_orthogonal_intersect + +求交集函数 + bitmap_orthogonal_intersect(bitmap_column, column_to_filter, filter_values) + +参数: Review comment: 每个参数的介绍是需要包含,每个参数是什么意思的的,比如 第一个参数类型是bitmap,是待求交集的列。 ########## File path: docs/zh-CN/extending-doris/udf/contrib/udaf-orthogonal-bitmap-manual.md ########## @@ -0,0 +1,209 @@ +--- +{ + "title": "BITMAP正交计算UDAF", + "language": "zh-CN" +} +--- + +<!-- +Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. +--> + +# BITMAP正交计算UDAF + +## 背景 + +Doris原有的Bitmap聚合函数设计比较通用,但对亿级别以上bitmap大基数的交集和并集计算性能较差。排查后端be的bitmap聚合函数逻辑,发现主要有两个原因。一是当bitmap基数较大时,如数据大小超过1g,网络/磁盘IO处理时间比较长;二是后端be实例在scan数据后全部传输到顶层节点进行求交和并运算,给顶层单节点带来压力,成为处理瓶颈。 + +解决方案是建表时增加hid列,罐库时hid列按照bitmap列的range划分,并且按hid均匀分桶。这样按range划分的聚合bitmap数据会均匀地分布在所有后端be实例上。在schema表的基础上,优化udaf聚合函数,使其在所有扫描节点参与分布式正交并算,然后在顶层节点进行汇总,如此会大大提高计算效率。 + +## Create table + +建表时需要使用聚合模型,数据类型是 bitmap , 聚合函数是 bitmap_union + +``` +CREATE TABLE `user_tag_bitmap` ( + `tag` bigint(20) NULL COMMENT "用户标签", + `hid` smallint(6) NULL COMMENT "分桶id", + `user_id` bitmap BITMAP_UNION NULL COMMENT "" +) ENGINE=OLAP +AGGREGATE KEY(`tag`, `hid`) +COMMENT "OLAP" +DISTRIBUTED BY HASH(`hid`) BUCKETS 3 +``` +表schema增加hid列,表示id范围, 作为hash分桶列。 + +注:hid数和BUCKETS要设置合理,hid数设置至少是BUCKETS的5倍以上,以使数据hash分桶尽量均衡 + +## Data Load + +``` +LOAD LABEL user_tag_bitmap_test +( +DATA INFILE('hdfs://abc') +INTO TABLE user_tag_bitmap +COLUMNS TERMINATED BY ',' +(tmp_tag, tmp_user_id) +SET ( +tag = tmp_tag, +hid = ceil(tmp_user_id/5000000), +user_id = to_bitmap(tmp_user_id) +) +) +... +``` +数据格式: +``` +11111111,1 +11111112,2 +11111113,3 +11111114,4 +... +``` +注:第一列代表用户标签,如'男', '90后', '10-20万'等,已由中文转换成数字 + +load数据时,对用户bitmap进行纵向切割,例如,用户id在1-5000000范围内的hid相同,hid相同的会被均匀的hash分配后端be实例进行union聚合。在bitmap的udaf实现上,可以利用tablet在be上平均分散的特性,在local节点scan数据后,直接进行交集、并集计算,在top节点merge阶段进行汇总计算结果,此设计能充分发挥所有be并发计算的特性。 + +## 自定义UDAF +Doris查询前设置参数 +``` +set parallel_fragment_exec_instance_num=5 +``` +注:根据集群情况设置并发参数,提高并发计算性能 + +新udaf需要在doris定义聚合函数时注册函数符号,函数符号通过动态库.so的方式被加载。 + +### bitmap_orthogonal_intersect + +求交集函数 + bitmap_orthogonal_intersect(bitmap_column, column_to_filter, filter_values) + +参数: + 第一个参数是Bitmap列,第二个参数是用来过滤的维度列,第三个参数开始是变长参数,含义是过滤维度列的不同取值 + +说明: + 此udaf,在此表schema的基础上,查询规划上聚合分2层,在第一层be节点(update、serialize)先按filter_values为key进行hash聚合,然后对所有key的bitmap求交集,结果序列化后发送至第二层be节点(merge、finalize),在第二层be节点对所有来源于第一层节点的bitmap值循环求并集 + + +定义: +``` +drop FUNCTION bitmap_orthogonal_intersect(BITMAP,BIGINT,BIGINT, ...); +CREATE AGGREGATE FUNCTION bitmap_orthogonal_intersect(BITMAP,BIGINT,BIGINT, ...) RETURNS BITMAP INTERMEDIATE varchar(1) Review comment: 注意文档中的名称统一 ########## File path: docs/zh-CN/extending-doris/udf/contrib/udaf-orthogonal-bitmap-manual.md ########## @@ -0,0 +1,209 @@ +--- +{ + "title": "BITMAP正交计算UDAF", Review comment: 正交的BITMAP计算UDAF ########## File path: docs/zh-CN/extending-doris/udf/contrib/udaf-orthogonal-bitmap-manual.md ########## @@ -0,0 +1,209 @@ +--- +{ + "title": "BITMAP正交计算UDAF", + "language": "zh-CN" +} +--- + +<!-- +Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. +--> + +# BITMAP正交计算UDAF + +## 背景 + +Doris原有的Bitmap聚合函数设计比较通用,但对亿级别以上bitmap大基数的交集和并集计算性能较差。排查后端be的bitmap聚合函数逻辑,发现主要有两个原因。一是当bitmap基数较大时,如数据大小超过1g,网络/磁盘IO处理时间比较长;二是后端be实例在scan数据后全部传输到顶层节点进行求交和并运算,给顶层单节点带来压力,成为处理瓶颈。 + +解决方案是建表时增加hid列,罐库时hid列按照bitmap列的range划分,并且按hid均匀分桶。这样按range划分的聚合bitmap数据会均匀地分布在所有后端be实例上。在schema表的基础上,优化udaf聚合函数,使其在所有扫描节点参与分布式正交并算,然后在顶层节点进行汇总,如此会大大提高计算效率。 Review comment: 可以先总说,解决思路是什么。比如思路是将 bitmap列的值先按照range划分,不同range的值存储在不同的分桶中。保证不同分桶之间的bitmap值是正交的。然后再说怎么详细,最后说为什么这么做可以加速查询 ########## File path: contrib/udf/src/udaf_orthogonal_bitmap/orthogonal_bitmap_function.cpp ########## @@ -0,0 +1,492 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include "orthogonal_bitmap_function.h" +#include "bitmap_value.h" +#include "string_value.h" +#include <iostream> + +namespace doris_udf { + +namespace detail { + +const int DATETIME_PACKED_TIME_BYTE_SIZE = 8; +const int DATETIME_TYPE_BYTE_SIZE = 4; + +const int DECIMAL_BYTE_SIZE = 16; + +// get_val start +template<typename ValType, typename T> +T get_val(const ValType& x) { + return x.val; +} + +template<> +StringValue get_val(const StringVal& x) { + return StringValue::from_string_val(x); +} +// get_val end + +// serialize_size start +template<typename T> +int32_t serialize_size(const T& v) { + return sizeof(T); +} + +template<> +int32_t serialize_size(const StringValue& v) { + return v.len + 4; +} +// serialize_size end + +// write_to start +template<typename T> +char* write_to(const T& v, char* dest) { + size_t type_size = sizeof(T); + memcpy(dest, &v, type_size); + dest += type_size; + return dest; +} + +template<> +char* write_to(const StringValue& v, char* dest) { + *(int32_t*)dest = v.len; + dest += 4; + memcpy(dest, v.ptr, v.len); + dest += v.len; + return dest; +} +// write_to end + +// read_from start +template<typename T> +void read_from(const char** src, T* result) { + size_t type_size = sizeof(T); + memcpy(result, *src, type_size); + *src += type_size; +} + +template<> +void read_from(const char** src, StringValue* result) { + int32_t length = *(int32_t*)(*src); + *src += 4; + *result = StringValue((char *)*src, length); + *src += length; +} +// read_from end + +} // namespace detail + +static StringVal serialize(FunctionContext* ctx, BitmapValue* value) { + StringVal result(ctx, value->getSizeInBytes()); + value->write((char*) result.ptr); + return result; +} + +// Calculate the intersection of two or more bitmaps +template<typename T> +struct BitmapIntersect { +public: + BitmapIntersect() {} + + explicit BitmapIntersect(const char* src) { + deserialize(src); + } + + void add_key(const T key) { + BitmapValue empty_bitmap; + _bitmaps[key] = empty_bitmap; + } + + void update(const T& key, const BitmapValue& bitmap) { + if (_bitmaps.find(key) != _bitmaps.end()) { + _bitmaps[key] |= bitmap; + } + } + + void merge(const BitmapIntersect& other) { + for (auto& kv: other._bitmaps) { + if (_bitmaps.find(kv.first) != _bitmaps.end()) { + _bitmaps[kv.first] |= kv.second; + } else { + _bitmaps[kv.first] = kv.second; + } + } + } + + // calculate the intersection for _bitmaps's bitmap values + int64_t intersect_count() const { + if (_bitmaps.empty()) { + return 0; + } + + BitmapValue result; + auto it = _bitmaps.begin(); + result |= it->second; + it++; + for (;it != _bitmaps.end(); it++) { + result &= it->second; + } + + return result.cardinality(); + } + + // intersection + BitmapValue intersect() { + BitmapValue result; + auto it = _bitmaps.begin(); + result |= it->second; + it++; + for (;it != _bitmaps.end(); it++) { + result &= it->second; + } + return result; + } + + // the serialize size + size_t size() { + size_t size = 4; + for (auto& kv: _bitmaps) { + size += detail::serialize_size(kv.first);; + size += kv.second.getSizeInBytes(); + } + return size; + } + + //must call size() first + void serialize(char* dest) { + char* writer = dest; + *(int32_t*)writer = _bitmaps.size(); + writer += 4; + for (auto& kv: _bitmaps) { + writer = detail::write_to(kv.first, writer); + kv.second.write(writer); + writer += kv.second.getSizeInBytes(); + } + } + + void deserialize(const char* src) { + const char* reader = src; + int32_t bitmaps_size = *(int32_t*)reader; + reader += 4; + for (int32_t i = 0; i < bitmaps_size; i++) { + T key; + detail::read_from(&reader, &key); + BitmapValue bitmap(reader); + reader += bitmap.getSizeInBytes(); + _bitmaps[key] = bitmap; + } + } + +private: + std::map<T, BitmapValue> _bitmaps; +}; + +void OrthogonalBitmapFunctions::init() { +} + +void OrthogonalBitmapFunctions::bitmap_union_count_init(FunctionContext* ctx, StringVal* dst) { + dst->is_null = false; + dst->len = sizeof(BitmapValue); + dst->ptr = (uint8_t*)new BitmapValue(); +} + +void OrthogonalBitmapFunctions::bitmap_union(FunctionContext* ctx, const StringVal& src, StringVal* dst) { + if (src.is_null) { + return; + } + auto dst_bitmap = reinterpret_cast<BitmapValue*>(dst->ptr); + // zero size means the src input is a agg object + if (src.len == 0) { + (*dst_bitmap) |= *reinterpret_cast<BitmapValue*>(src.ptr); + } else { + (*dst_bitmap) |= BitmapValue((char*) src.ptr); + } +} + +StringVal OrthogonalBitmapFunctions::bitmap_serialize(FunctionContext* ctx, const StringVal& src) { + if (src.is_null) { + return src; + } + + auto src_bitmap = reinterpret_cast<BitmapValue*>(src.ptr); + StringVal result = serialize(ctx, src_bitmap); + delete src_bitmap; + return result; +} + +StringVal OrthogonalBitmapFunctions::bitmap_count_serialize(FunctionContext* ctx, const StringVal& src) { + if (src.is_null) { + return src; + } + + auto src_bitmap = reinterpret_cast<BitmapValue*>(src.ptr); + int64_t val = src_bitmap->cardinality(); + StringVal result(ctx, sizeof(int64_t)); + + *(int64_t*)result.ptr = val; + delete src_bitmap; + return result; + +} + +// This is a init function for bitmap_intersect. +template<typename T, typename ValType> +void OrthogonalBitmapFunctions::bitmap_intersect_init(FunctionContext* ctx, StringVal* dst) { + // constant args start from index 2 + if (ctx->get_num_constant_args() > 1) { + dst->is_null = false; + dst->len = sizeof(BitmapIntersect<T>); + auto intersect = new BitmapIntersect<T>(); + + for (int i = 2; i < ctx->get_num_constant_args(); ++i) { + ValType* arg = reinterpret_cast<ValType*>(ctx->get_constant_arg(i)); + intersect->add_key(detail::get_val<ValType, T>(*arg)); + } + + dst->ptr = (uint8_t*)intersect; + } else { + dst->is_null = false; + dst->len = sizeof(BitmapValue); + dst->ptr = (uint8_t*)new BitmapValue(); + } +} + +// This is a init function for intersect_count. +template<typename T, typename ValType> +void OrthogonalBitmapFunctions::bitmap_intersect_count_init(FunctionContext* ctx, StringVal* dst) { + if (ctx->get_num_constant_args() > 1) { + dst->is_null = false; + dst->len = sizeof(BitmapIntersect<T>); + auto intersect = new BitmapIntersect<T>(); + + // constant args start from index 2 + for (int i = 2; i < ctx->get_num_constant_args(); ++i) { + ValType* arg = reinterpret_cast<ValType*>(ctx->get_constant_arg(i)); + intersect->add_key(detail::get_val<ValType, T>(*arg)); + } + + dst->ptr = (uint8_t*)intersect; + } else { + dst->is_null = false; + dst->len = sizeof(int64_t); + dst->ptr = (uint8_t*)new int64_t; + *(int64_t *)dst->ptr = 0; + } +} + +template<typename T, typename ValType> +void OrthogonalBitmapFunctions::bitmap_intersect_update(FunctionContext* ctx, const StringVal& src, const ValType& key, + int num_key, const ValType* keys, const StringVal* dst) { + auto* dst_bitmap = reinterpret_cast<BitmapIntersect<T>*>(dst->ptr); + // zero size means the src input is a agg object + if (src.len == 0) { + dst_bitmap->update(detail::get_val<ValType, T>(key), *reinterpret_cast<BitmapValue*>(src.ptr)); + } else { + dst_bitmap->update(detail::get_val<ValType, T>(key), BitmapValue((char*)src.ptr)); + } +} + +template<typename T> +void OrthogonalBitmapFunctions::bitmap_intersect_merge(FunctionContext* ctx, const StringVal& src, const StringVal* dst) { + auto* dst_bitmap = reinterpret_cast<BitmapIntersect<T>*>(dst->ptr); + dst_bitmap->merge(BitmapIntersect<T>((char*)src.ptr)); +} + +template<typename T> +StringVal OrthogonalBitmapFunctions::bitmap_intersect_serialize(FunctionContext* ctx, const StringVal& src) { + auto* src_bitmap = reinterpret_cast<BitmapIntersect<T>*>(src.ptr); + StringVal result(ctx, src_bitmap->size()); + src_bitmap->serialize((char*)result.ptr); + delete src_bitmap; + return result; +} + +template<typename T> +BigIntVal OrthogonalBitmapFunctions::bitmap_intersect_finalize(FunctionContext* ctx, const StringVal& src) { + auto* src_bitmap = reinterpret_cast<BitmapIntersect<T>*>(src.ptr); + BigIntVal result = BigIntVal(src_bitmap->intersect_count()); + delete src_bitmap; + return result; +} + +void OrthogonalBitmapFunctions::bitmap_count_merge(FunctionContext* context, const StringVal& src, StringVal* dst) { + if (dst->len != sizeof(int64_t)) { Review comment: Will dst be bitmap value? ########## File path: docs/zh-CN/extending-doris/udf/contrib/udaf-orthogonal-bitmap-manual.md ########## @@ -0,0 +1,209 @@ +--- +{ + "title": "BITMAP正交计算UDAF", + "language": "zh-CN" +} +--- + +<!-- +Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. +--> + +# BITMAP正交计算UDAF + +## 背景 + +Doris原有的Bitmap聚合函数设计比较通用,但对亿级别以上bitmap大基数的交集和并集计算性能较差。排查后端be的bitmap聚合函数逻辑,发现主要有两个原因。一是当bitmap基数较大时,如数据大小超过1g,网络/磁盘IO处理时间比较长;二是后端be实例在scan数据后全部传输到顶层节点进行求交和并运算,给顶层单节点带来压力,成为处理瓶颈。 + +解决方案是建表时增加hid列,罐库时hid列按照bitmap列的range划分,并且按hid均匀分桶。这样按range划分的聚合bitmap数据会均匀地分布在所有后端be实例上。在schema表的基础上,优化udaf聚合函数,使其在所有扫描节点参与分布式正交并算,然后在顶层节点进行汇总,如此会大大提高计算效率。 + +## Create table + +建表时需要使用聚合模型,数据类型是 bitmap , 聚合函数是 bitmap_union + +``` +CREATE TABLE `user_tag_bitmap` ( + `tag` bigint(20) NULL COMMENT "用户标签", + `hid` smallint(6) NULL COMMENT "分桶id", + `user_id` bitmap BITMAP_UNION NULL COMMENT "" +) ENGINE=OLAP +AGGREGATE KEY(`tag`, `hid`) +COMMENT "OLAP" +DISTRIBUTED BY HASH(`hid`) BUCKETS 3 +``` +表schema增加hid列,表示id范围, 作为hash分桶列。 + +注:hid数和BUCKETS要设置合理,hid数设置至少是BUCKETS的5倍以上,以使数据hash分桶尽量均衡 + +## Data Load + +``` +LOAD LABEL user_tag_bitmap_test +( +DATA INFILE('hdfs://abc') +INTO TABLE user_tag_bitmap +COLUMNS TERMINATED BY ',' +(tmp_tag, tmp_user_id) +SET ( +tag = tmp_tag, +hid = ceil(tmp_user_id/5000000), +user_id = to_bitmap(tmp_user_id) +) +) +... +``` +数据格式: +``` +11111111,1 +11111112,2 +11111113,3 +11111114,4 +... +``` +注:第一列代表用户标签,如'男', '90后', '10-20万'等,已由中文转换成数字 + +load数据时,对用户bitmap进行纵向切割,例如,用户id在1-5000000范围内的hid相同,hid相同的会被均匀的hash分配后端be实例进行union聚合。在bitmap的udaf实现上,可以利用tablet在be上平均分散的特性,在local节点scan数据后,直接进行交集、并集计算,在top节点merge阶段进行汇总计算结果,此设计能充分发挥所有be并发计算的特性。 + +## 自定义UDAF +Doris查询前设置参数 +``` +set parallel_fragment_exec_instance_num=5 +``` +注:根据集群情况设置并发参数,提高并发计算性能 + +新udaf需要在doris定义聚合函数时注册函数符号,函数符号通过动态库.so的方式被加载。 + +### bitmap_orthogonal_intersect + +求交集函数 + bitmap_orthogonal_intersect(bitmap_column, column_to_filter, filter_values) + +参数: + 第一个参数是Bitmap列,第二个参数是用来过滤的维度列,第三个参数开始是变长参数,含义是过滤维度列的不同取值 + +说明: + 此udaf,在此表schema的基础上,查询规划上聚合分2层,在第一层be节点(update、serialize)先按filter_values为key进行hash聚合,然后对所有key的bitmap求交集,结果序列化后发送至第二层be节点(merge、finalize),在第二层be节点对所有来源于第一层节点的bitmap值循环求并集 + + +定义: +``` +drop FUNCTION bitmap_orthogonal_intersect(BITMAP,BIGINT,BIGINT, ...); +CREATE AGGREGATE FUNCTION bitmap_orthogonal_intersect(BITMAP,BIGINT,BIGINT, ...) RETURNS BITMAP INTERMEDIATE varchar(1) +PROPERTIES ( +"init_fn"="_ZN9doris_udf25OrthogonalBitmapFunctions21bitmap_intersect_initIlNS_9BigIntValEEEvPNS_15FunctionContextEPNS_9StringValE", +"update_fn"="_ZN9doris_udf25OrthogonalBitmapFunctions23bitmap_intersect_updateIlNS_9BigIntValEEEvPNS_15FunctionContextERKNS_9StringValERKT0_iPS9_PS6_", +"serialize_fn"="_ZN9doris_udf25OrthogonalBitmapFunctions30bitmap_intersect_and_serializeIlEENS_9StringValEPNS_15FunctionContextERKS2_", +"merge_fn"="_ZN9doris_udf25OrthogonalBitmapFunctions12bitmap_unionEPNS_15FunctionContextERKNS_9StringValEPS3_", +"finalize_fn"="_ZN9doris_udf25OrthogonalBitmapFunctions16bitmap_serializeEPNS_15FunctionContextERKNS_9StringValE", +"object_file"="http://ip:port/libudaf_orthogonal_bitmap.so" ); + +``` +注意: +1.column_to_filter, filter_values列这里设置为BIGINT类型; +2.函数符号通过nm /xxx/xxx/libudaf_bitmap.so|grep "bitmap_intersect" 查找 + +样例: +``` +select BITMAP_COUNT(bitmap_orthogonal_intersect(user_id, tag, 13080800, 11110200)) from user_tag_bitmap where tag in (13080800, 11110200); + +``` + +### bitmap_orthogonal_intersect_count +求交集count函数: + bitmap_orthogonal_intersect_count(bitmap_column, column_to_filter, filter_values) + +参数: + 第一个参数是Bitmap列,第二个参数是用来过滤的维度列,第三个参数开始是变长参数,含义是过滤维度列的不同取值 + +说明: + 此udaf定义同原版intersect_count,但实现不同。 + 此udaf,在此表schema的基础上,查询规划聚合上分2层,在第一层be节点(update、serialize)先按filter_values为key进行hash聚合,然后对所有key的bitmap求交集,再对交集结果求count,count值序列化后发送至第二层be节点(merge、finalize),在第二层be节点对所有来源于第一层节点的count值循环求sum + +定义: +``` +drop FUNCTION bitmap_orthogonal_intersect_count(BITMAP,BIGINT,BIGINT, ...); +CREATE AGGREGATE FUNCTION bitmap_orthogonal_intersect_count(BITMAP,BIGINT,BIGINT, ...) RETURNS BIGINT INTERMEDIATE varchar(1) +PROPERTIES ( +"init_fn"="_ZN9doris_udf25OrthogonalBitmapFunctions27bitmap_intersect_count_initIlNS_9BigIntValEEEvPNS_15FunctionContextEPNS_9StringValE", +"update_fn"="_ZN9doris_udf25OrthogonalBitmapFunctions23bitmap_intersect_updateIlNS_9BigIntValEEEvPNS_15FunctionContextERKNS_9StringValERKT0_iPS9_PS6_", +"serialize_fn"="_ZN9doris_udf25OrthogonalBitmapFunctions32bitmap_intersect_count_serializeIlEENS_9StringValEPNS_15FunctionContextERKS2_", +"merge_fn"="_ZN9doris_udf25OrthogonalBitmapFunctions18bitmap_count_mergeEPNS_15FunctionContextERKNS_9StringValEPS3_", +"finalize_fn"="_ZN9doris_udf25OrthogonalBitmapFunctions21bitmap_count_finalizeEPNS_15FunctionContextERKNS_9StringValE", +"object_file"="http://ip:port/libudaf_orthogonal_bitmap.so" ); +``` + +### bitmap_orthogonal_union_count +求并集count函数: + bitmap_orthogonal_union_count(bitmap_column) + +说明: + 此udaf定义同原版bitmap_union_count,但实现不同。 + 此udaf,在此表schema的基础上,查询规划上分2层,在第一层be节点(update、serialize)对所有bitmap求并集,再对并集的结果bitmap求count,count值序列化后发送至第二层be节点(merge、finalize),在第二层be节点对所有来源于第一层节点的count值循环求sum + +定义: +``` +drop FUNCTION bitmap_orthogonal_union_count(BITMAP); +CREATE AGGREGATE FUNCTION bitmap_orthogonal_union_count(BITMAP) RETURNS BIGINT INTERMEDIATE varchar(1) +PROPERTIES ( +"init_fn"="_ZN9doris_udf25OrthogonalBitmapFunctions23bitmap_union_count_initEPNS_15FunctionContextEPNS_9StringValE", +"update_fn"="_ZN9doris_udf25OrthogonalBitmapFunctions12bitmap_unionEPNS_15FunctionContextERKNS_9StringValEPS3_", +"serialize_fn"="_ZN9doris_udf25OrthogonalBitmapFunctions22bitmap_count_serializeEPNS_15FunctionContextERKNS_9StringValE", +"merge_fn"="_ZN9doris_udf25OrthogonalBitmapFunctions18bitmap_count_mergeEPNS_15FunctionContextERKNS_9StringValEPS3_", +"finalize_fn"="_ZN9doris_udf25OrthogonalBitmapFunctions21bitmap_count_finalizeEPNS_15FunctionContextERKNS_9StringValE", +"object_file"="http://ip:port/libudaf_orthogonal_bitmap.so" ); +``` + +## 源码及编译 +源代码: +``` +contrib/udf/src/udaf_bitmap/ Review comment: 名称统一一下,比如 都用udaf_orthogonal_bitmap ########## File path: docs/zh-CN/extending-doris/udf/contrib/udaf-orthogonal-bitmap-manual.md ########## @@ -0,0 +1,209 @@ +--- +{ + "title": "BITMAP正交计算UDAF", + "language": "zh-CN" +} +--- + +<!-- +Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. +--> + +# BITMAP正交计算UDAF + +## 背景 + +Doris原有的Bitmap聚合函数设计比较通用,但对亿级别以上bitmap大基数的交集和并集计算性能较差。排查后端be的bitmap聚合函数逻辑,发现主要有两个原因。一是当bitmap基数较大时,如数据大小超过1g,网络/磁盘IO处理时间比较长;二是后端be实例在scan数据后全部传输到顶层节点进行求交和并运算,给顶层单节点带来压力,成为处理瓶颈。 + +解决方案是建表时增加hid列,罐库时hid列按照bitmap列的range划分,并且按hid均匀分桶。这样按range划分的聚合bitmap数据会均匀地分布在所有后端be实例上。在schema表的基础上,优化udaf聚合函数,使其在所有扫描节点参与分布式正交并算,然后在顶层节点进行汇总,如此会大大提高计算效率。 + +## Create table + +建表时需要使用聚合模型,数据类型是 bitmap , 聚合函数是 bitmap_union + +``` +CREATE TABLE `user_tag_bitmap` ( + `tag` bigint(20) NULL COMMENT "用户标签", + `hid` smallint(6) NULL COMMENT "分桶id", + `user_id` bitmap BITMAP_UNION NULL COMMENT "" +) ENGINE=OLAP +AGGREGATE KEY(`tag`, `hid`) +COMMENT "OLAP" +DISTRIBUTED BY HASH(`hid`) BUCKETS 3 +``` +表schema增加hid列,表示id范围, 作为hash分桶列。 + +注:hid数和BUCKETS要设置合理,hid数设置至少是BUCKETS的5倍以上,以使数据hash分桶尽量均衡 + +## Data Load + +``` +LOAD LABEL user_tag_bitmap_test +( +DATA INFILE('hdfs://abc') +INTO TABLE user_tag_bitmap +COLUMNS TERMINATED BY ',' +(tmp_tag, tmp_user_id) +SET ( +tag = tmp_tag, +hid = ceil(tmp_user_id/5000000), +user_id = to_bitmap(tmp_user_id) +) +) +... +``` +数据格式: +``` +11111111,1 +11111112,2 +11111113,3 +11111114,4 +... +``` +注:第一列代表用户标签,如'男', '90后', '10-20万'等,已由中文转换成数字 + +load数据时,对用户bitmap进行纵向切割,例如,用户id在1-5000000范围内的hid相同,hid相同的会被均匀的hash分配后端be实例进行union聚合。在bitmap的udaf实现上,可以利用tablet在be上平均分散的特性,在local节点scan数据后,直接进行交集、并集计算,在top节点merge阶段进行汇总计算结果,此设计能充分发挥所有be并发计算的特性。 + +## 自定义UDAF +Doris查询前设置参数 +``` +set parallel_fragment_exec_instance_num=5 +``` +注:根据集群情况设置并发参数,提高并发计算性能 + +新udaf需要在doris定义聚合函数时注册函数符号,函数符号通过动态库.so的方式被加载。 + +### bitmap_orthogonal_intersect + +求交集函数 + bitmap_orthogonal_intersect(bitmap_column, column_to_filter, filter_values) + +参数: + 第一个参数是Bitmap列,第二个参数是用来过滤的维度列,第三个参数开始是变长参数,含义是过滤维度列的不同取值 + +说明: + 此udaf,在此表schema的基础上,查询规划上聚合分2层,在第一层be节点(update、serialize)先按filter_values为key进行hash聚合,然后对所有key的bitmap求交集,结果序列化后发送至第二层be节点(merge、finalize),在第二层be节点对所有来源于第一层节点的bitmap值循环求并集 + + +定义: +``` +drop FUNCTION bitmap_orthogonal_intersect(BITMAP,BIGINT,BIGINT, ...); +CREATE AGGREGATE FUNCTION bitmap_orthogonal_intersect(BITMAP,BIGINT,BIGINT, ...) RETURNS BITMAP INTERMEDIATE varchar(1) +PROPERTIES ( +"init_fn"="_ZN9doris_udf25OrthogonalBitmapFunctions21bitmap_intersect_initIlNS_9BigIntValEEEvPNS_15FunctionContextEPNS_9StringValE", +"update_fn"="_ZN9doris_udf25OrthogonalBitmapFunctions23bitmap_intersect_updateIlNS_9BigIntValEEEvPNS_15FunctionContextERKNS_9StringValERKT0_iPS9_PS6_", +"serialize_fn"="_ZN9doris_udf25OrthogonalBitmapFunctions30bitmap_intersect_and_serializeIlEENS_9StringValEPNS_15FunctionContextERKS2_", +"merge_fn"="_ZN9doris_udf25OrthogonalBitmapFunctions12bitmap_unionEPNS_15FunctionContextERKNS_9StringValEPS3_", +"finalize_fn"="_ZN9doris_udf25OrthogonalBitmapFunctions16bitmap_serializeEPNS_15FunctionContextERKNS_9StringValE", +"object_file"="http://ip:port/libudaf_orthogonal_bitmap.so" ); + +``` +注意: +1.column_to_filter, filter_values列这里设置为BIGINT类型; +2.函数符号通过nm /xxx/xxx/libudaf_bitmap.so|grep "bitmap_intersect" 查找 + +样例: +``` +select BITMAP_COUNT(bitmap_orthogonal_intersect(user_id, tag, 13080800, 11110200)) from user_tag_bitmap where tag in (13080800, 11110200); + +``` + +### bitmap_orthogonal_intersect_count +求交集count函数: + bitmap_orthogonal_intersect_count(bitmap_column, column_to_filter, filter_values) + +参数: + 第一个参数是Bitmap列,第二个参数是用来过滤的维度列,第三个参数开始是变长参数,含义是过滤维度列的不同取值 + +说明: + 此udaf定义同原版intersect_count,但实现不同。 + 此udaf,在此表schema的基础上,查询规划聚合上分2层,在第一层be节点(update、serialize)先按filter_values为key进行hash聚合,然后对所有key的bitmap求交集,再对交集结果求count,count值序列化后发送至第二层be节点(merge、finalize),在第二层be节点对所有来源于第一层节点的count值循环求sum + +定义: +``` +drop FUNCTION bitmap_orthogonal_intersect_count(BITMAP,BIGINT,BIGINT, ...); +CREATE AGGREGATE FUNCTION bitmap_orthogonal_intersect_count(BITMAP,BIGINT,BIGINT, ...) RETURNS BIGINT INTERMEDIATE varchar(1) +PROPERTIES ( +"init_fn"="_ZN9doris_udf25OrthogonalBitmapFunctions27bitmap_intersect_count_initIlNS_9BigIntValEEEvPNS_15FunctionContextEPNS_9StringValE", +"update_fn"="_ZN9doris_udf25OrthogonalBitmapFunctions23bitmap_intersect_updateIlNS_9BigIntValEEEvPNS_15FunctionContextERKNS_9StringValERKT0_iPS9_PS6_", +"serialize_fn"="_ZN9doris_udf25OrthogonalBitmapFunctions32bitmap_intersect_count_serializeIlEENS_9StringValEPNS_15FunctionContextERKS2_", +"merge_fn"="_ZN9doris_udf25OrthogonalBitmapFunctions18bitmap_count_mergeEPNS_15FunctionContextERKNS_9StringValEPS3_", +"finalize_fn"="_ZN9doris_udf25OrthogonalBitmapFunctions21bitmap_count_finalizeEPNS_15FunctionContextERKNS_9StringValE", +"object_file"="http://ip:port/libudaf_orthogonal_bitmap.so" ); +``` + +### bitmap_orthogonal_union_count +求并集count函数: + bitmap_orthogonal_union_count(bitmap_column) + +说明: + 此udaf定义同原版bitmap_union_count,但实现不同。 + 此udaf,在此表schema的基础上,查询规划上分2层,在第一层be节点(update、serialize)对所有bitmap求并集,再对并集的结果bitmap求count,count值序列化后发送至第二层be节点(merge、finalize),在第二层be节点对所有来源于第一层节点的count值循环求sum + +定义: +``` +drop FUNCTION bitmap_orthogonal_union_count(BITMAP); +CREATE AGGREGATE FUNCTION bitmap_orthogonal_union_count(BITMAP) RETURNS BIGINT INTERMEDIATE varchar(1) +PROPERTIES ( +"init_fn"="_ZN9doris_udf25OrthogonalBitmapFunctions23bitmap_union_count_initEPNS_15FunctionContextEPNS_9StringValE", +"update_fn"="_ZN9doris_udf25OrthogonalBitmapFunctions12bitmap_unionEPNS_15FunctionContextERKNS_9StringValEPS3_", +"serialize_fn"="_ZN9doris_udf25OrthogonalBitmapFunctions22bitmap_count_serializeEPNS_15FunctionContextERKNS_9StringValE", +"merge_fn"="_ZN9doris_udf25OrthogonalBitmapFunctions18bitmap_count_mergeEPNS_15FunctionContextERKNS_9StringValEPS3_", +"finalize_fn"="_ZN9doris_udf25OrthogonalBitmapFunctions21bitmap_count_finalizeEPNS_15FunctionContextERKNS_9StringValE", +"object_file"="http://ip:port/libudaf_orthogonal_bitmap.so" ); +``` + +## 源码及编译 +源代码: +``` +contrib/udf/src/udaf_bitmap/ +|-- bitmap_value.h +|-- CMakeLists.txt +|-- custom_bitmap_function.cpp +|-- custom_bitmap_function.h +`-- string_value.h +``` +编译UDAF: +``` +$cd contrib/udf +$ sh build_udf.sh + +``` +libudaf_bitmap.so产出目录: +``` +output/contrib/udf/lib/udaf_bitmap/libudaf_bitmap.so Review comment: 名字好像是? libudaf_orthogonal_bitmap.so? ########## File path: docs/en/extending-doris/udf/contrib/udaf-orthogonal-bitmap-manual.md ########## @@ -0,0 +1,239 @@ +--- +{ + "title": "bitmap orthogonal calculation udaf", + "language": "en" +} +--- + +<!-- +Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. +--> + +# Bitmap orthogonal calculation udaf + +## Background + +The original bitmap aggregate function designed by Doris is more general, but it has poor performance for the intersection and union of bitmap large cardinality above 100 million level. There are two main reasons for checking the bitmap aggregate function logic of the back-end be. First, when the bitmap cardinality is large, if the data size exceeds 1g, the network / disk IO processing time is relatively long; second, after the scan data, all the back-end be instances are transmitted to the top-level node for intersection and union operation, which brings pressure on the top-level single node and becomes the processing bottleneck. + +The solution is to add the HID column when creating the table, divide the HID column according to the range of the bitmap column, and evenly divide the buckets according to the HID. In this way, the aggregated bitmap data divided by range is evenly distributed across all back-end be instances. Based on the schema table, the udaf aggregation function is optimized to make it participate in distributed orthogonal calculation at all scanning nodes, and then summarize at the top node, which will greatly improve the computational efficiency. + +## Create table + +We need to use the aggregation model when building tables. The data type is bitmap, and the aggregation function is bitmap_ union +``` +CREATE TABLE `user_tag_bitmap` ( + `tag` bigint(20) NULL COMMENT "user tag", + `hid` smallint(6) NULL COMMENT "Bucket ID", + `user_id` bitmap BITMAP_UNION NULL COMMENT "" +) ENGINE=OLAP +AGGREGATE KEY(`tag`, `hid`) +COMMENT "OLAP" +DISTRIBUTED BY HASH(`hid`) BUCKETS 3 +``` +The HID column is added to the table schema to indicate the ID range as a hash bucket column. + +Note: the HID number and buckets should be set reasonably, and the HID number should be set at least 5 times of buckets, so as to make the data hash bucket division as balanced as possible + + +## Data Load + +``` +LOAD LABEL user_tag_bitmap_test +( +DATA INFILE('hdfs://abc') +INTO TABLE user_tag_bitmap +COLUMNS TERMINATED BY ',' +(tmp_tag, tmp_user_id) +SET ( +tag = tmp_tag, +hid = ceil(tmp_user_id/5000000), +user_id = to_bitmap(tmp_user_id) +) +) +... +``` + +Data format: + +``` +11111111,1 +11111112,2 +11111113,3 +11111114,4 +... +``` + +Note: the first column represents the user tags, such as' male ',' post-90s', '100000-200000', etc., which have been converted from Chinese into numbers + +When the data is loaded, the user's bitmap is cut vertically. For example, if the HID of the user ID in the range of 1-5000000 is the same, those with the same hid will be evenly allocated to the back-end be instances for union aggregation. After computing, all the nodes can be calculated in the intersection of local nodes, which can make full use of the computing characteristics of local bitmbe. + +## Custom udaf + +Setting parameters before Doris query + +``` +set parallel_fragment_exec_instance_num=5 +``` + +Note: set concurrency parameters according to cluster conditions to improve concurrent computing performance + +The new udaf needs to register function symbols when Doris defines aggregate functions. Function symbols are loaded by dynamic library. So. + +### bitmap_orthogonal_intersect + + +Orthogonal intersection function + +bitmap_orthogonal_intersect(bitmap_column, column_to_filter, filter_values) + +Parameters: + +the first parameter is the bitmap column, the second parameter is the dimension column for filtering, and the third parameter is the variable length parameter, which means different values of the filter dimension column + +Explain: + +on the basis of this table schema, this udaf has two levels of aggregation in query planning. In the first layer, be nodes (update and serialize) first press filter_ Values are used to hash aggregate the keys, and then the bitmaps of all keys are intersected. The results are serialized and sent to the second level be nodes (merge and finalize). In the second level be nodes, all the bitmap values from the first level nodes are combined circularly + +Definition: + +``` +drop FUNCTION bitmap_orthogonal_intersect(BITMAP,BIGINT,BIGINT, ...); +CREATE AGGREGATE FUNCTION bitmap_orthogonal_intersect(BITMAP,BIGINT,BIGINT, ...) RETURNS BITMAP INTERMEDIATE varchar(1) +PROPERTIES ( +"init_fn"="_ZN9doris_udf25OrthogonalBitmapFunctions21bitmap_intersect_initIlNS_9BigIntValEEEvPNS_15FunctionContextEPNS_9StringValE", +"update_fn"="_ZN9doris_udf25OrthogonalBitmapFunctions23bitmap_intersect_updateIlNS_9BigIntValEEEvPNS_15FunctionContextERKNS_9StringValERKT0_iPS9_PS6_", +"serialize_fn"="_ZN9doris_udf25OrthogonalBitmapFunctions30bitmap_intersect_and_serializeIlEENS_9StringValEPNS_15FunctionContextERKS2_", +"merge_fn"="_ZN9doris_udf25OrthogonalBitmapFunctions12bitmap_unionEPNS_15FunctionContextERKNS_9StringValEPS3_", +"finalize_fn"="_ZN9doris_udf25OrthogonalBitmapFunctions16bitmap_serializeEPNS_15FunctionContextERKNS_9StringValE", +"object_file"="http://ip:port/libudaf_orthogonal_bitmap.so" ); + +``` + +Note: + +1. column_to_filter, filter_values column is set to bigint type here; + +2. the function symbol passes through nm /xxx/xxx/libudaf_bitmap.so|grep "bitmap_" + +Example: + +``` +select BITMAP_COUNT(bitmap_orthogonal_intersect(user_id, tag, 13080800, 11110200)) from user_tag_bitmap where tag in (13080800, 11110200); + +``` + +### bitmap_orthogonal_intersect_count + +Calculate the intersection count function: + +bitmap_orthogonal_intersect_count(bitmap_column, column_to_filter, filter_values) + +Parameters: + +The first parameter is the bitmap column, the second parameter is the dimension column for filtering, and the third parameter is the variable length parameter, which means different values of the filter dimension column + +Explain: + +this udaf definition is the same as the original intersect_count, but the implementation is different. + +on the basis of this table schema, the query planning aggregation is divided into two layers. In the first layer, be nodes (update and serialize) first press filter_ Values are used to hash aggregate the keys, and then the intersection of bitmaps of all keys is performed, and then the intersection results are counted. The count values are serialized and sent to the second level be nodes (merge and finalize). In the second level be nodes, the sum of all the count values from the first level nodes is calculated circularly + +Definition: + +``` +drop FUNCTION bitmap_orthogonal_intersect_count(BITMAP,BIGINT,BIGINT, ...); +CREATE AGGREGATE FUNCTION bitmap_orthogonal_intersect_count(BITMAP,BIGINT,BIGINT, ...) RETURNS BIGINT INTERMEDIATE varchar(1) +PROPERTIES ( +"init_fn"="_ZN9doris_udf25OrthogonalBitmapFunctions27bitmap_intersect_count_initIlNS_9BigIntValEEEvPNS_15FunctionContextEPNS_9StringValE", +"update_fn"="_ZN9doris_udf25OrthogonalBitmapFunctions23bitmap_intersect_updateIlNS_9BigIntValEEEvPNS_15FunctionContextERKNS_9StringValERKT0_iPS9_PS6_", +"serialize_fn"="_ZN9doris_udf25OrthogonalBitmapFunctions32bitmap_intersect_count_serializeIlEENS_9StringValEPNS_15FunctionContextERKS2_", +"merge_fn"="_ZN9doris_udf25OrthogonalBitmapFunctions18bitmap_count_mergeEPNS_15FunctionContextERKNS_9StringValEPS3_", +"finalize_fn"="_ZN9doris_udf25OrthogonalBitmapFunctions21bitmap_count_finalizeEPNS_15FunctionContextERKNS_9StringValE", +"object_file"="http://ip:port/libudaf_orthogonal_bitmap.so" ); +``` + +### bitmap_orthogonal_union_count + +Union count function: + +bitmap_orthogonal_union_count(bitmap_column) + +Explain: + +this udaf definition is the same as the original bitmap_ union_ Count, but the implementation is different. + +on the basis of this table schema, this udaf is divided into two layers. In the first layer, be nodes (update and serialize) merge all the bitmaps, and then count the resulting bitmaps. The count values are serialized and sent to the second level be nodes (merge and finalize). In the second layer, the be nodes are used to calculate the sum of all the count values from the first level nodes + +Definition: + +``` +drop FUNCTION bitmap_orthogonal_union_count(BITMAP); +CREATE AGGREGATE FUNCTION bitmap_orthogonal_union_count(BITMAP) RETURNS BIGINT INTERMEDIATE varchar(1) +PROPERTIES ( +"init_fn"="_ZN9doris_udf25OrthogonalBitmapFunctions23bitmap_union_count_initEPNS_15FunctionContextEPNS_9StringValE", +"update_fn"="_ZN9doris_udf25OrthogonalBitmapFunctions12bitmap_unionEPNS_15FunctionContextERKNS_9StringValEPS3_", +"serialize_fn"="_ZN9doris_udf25OrthogonalBitmapFunctions22bitmap_count_serializeEPNS_15FunctionContextERKNS_9StringValE", +"merge_fn"="_ZN9doris_udf25OrthogonalBitmapFunctions18bitmap_count_mergeEPNS_15FunctionContextERKNS_9StringValEPS3_", +"finalize_fn"="_ZN9doris_udf25OrthogonalBitmapFunctions21bitmap_count_finalizeEPNS_15FunctionContextERKNS_9StringValE", +"object_file"="http://ip:port/libudaf_orthogonal_bitmap.so" ); +``` + +## Source code and compilation + +Source code: + +``` +contrib/udf/src/udaf_bitmap/ +|-- bitmap_value.h +|-- CMakeLists.txt +|-- custom_bitmap_function.cpp Review comment: ```suggestion |-- orthogonal_bitmap_function.cpp ``` ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org For additional commands, e-mail: commits-h...@doris.apache.org