wangbo opened a new issue, #11032: URL: https://github.com/apache/doris/issues/11032
### Search before asking - [X] I had searched in the [issues](https://github.com/apache/incubator-doris/issues?q=is%3Aissue) and found no similar issues. ### Description I found in the case which count distinct one String Column, PHHashMap's performance is worse than non-vectorization version. # How to reproduce 1 create a table named lineorder_flat_string; ``` Note that LO_ORDERKEY is defined as varchar. CREATE TABLE `lineorder_flat_string` ( `LO_ORDERDATE` date NOT NULL COMMENT "", `LO_ORDERKEY` varchar(500) NOT NULL COMMENT "", `LO_LINENUMBER` tinyint(4) NOT NULL COMMENT "", `LO_CUSTKEY` int(11) NOT NULL COMMENT "", `LO_PARTKEY` int(11) NOT NULL COMMENT "", `LO_SUPPKEY` int(11) NOT NULL COMMENT "", `LO_ORDERPRIORITY` varchar(100) NOT NULL COMMENT "", `LO_SHIPPRIORITY` tinyint(4) NOT NULL COMMENT "", `LO_QUANTITY` tinyint(4) NOT NULL COMMENT "", `LO_EXTENDEDPRICE` int(11) NOT NULL COMMENT "", `LO_ORDTOTALPRICE` int(11) NOT NULL COMMENT "", `LO_DISCOUNT` tinyint(4) NOT NULL COMMENT "", `LO_REVENUE` int(11) NOT NULL COMMENT "", `LO_SUPPLYCOST` int(11) NOT NULL COMMENT "", `LO_TAX` tinyint(4) NOT NULL COMMENT "", `LO_COMMITDATE` date NOT NULL COMMENT "", `LO_SHIPMODE` varchar(100) NOT NULL COMMENT "", `C_NAME` varchar(100) NOT NULL COMMENT "", `C_ADDRESS` varchar(100) NOT NULL COMMENT "", `C_CITY` varchar(100) NOT NULL COMMENT "", `C_NATION` varchar(100) NOT NULL COMMENT "", `C_REGION` varchar(100) NOT NULL COMMENT "", `C_PHONE` varchar(100) NOT NULL COMMENT "", `C_MKTSEGMENT` varchar(100) NOT NULL COMMENT "", `S_NAME` varchar(100) NOT NULL COMMENT "", `S_ADDRESS` varchar(100) NOT NULL COMMENT "", `S_CITY` varchar(100) NOT NULL COMMENT "", `S_NATION` varchar(100) NOT NULL COMMENT "", `S_REGION` varchar(100) NOT NULL COMMENT "", `S_PHONE` varchar(100) NOT NULL COMMENT "", `P_NAME` varchar(100) NOT NULL COMMENT "", `P_MFGR` varchar(100) NOT NULL COMMENT "", `P_CATEGORY` varchar(100) NOT NULL COMMENT "", `P_BRAND` varchar(100) NOT NULL COMMENT "", `P_COLOR` varchar(100) NOT NULL COMMENT "", `P_TYPE` varchar(100) NOT NULL COMMENT "", `P_SIZE` tinyint(4) NOT NULL COMMENT "", `P_CONTAINER` varchar(100) NOT NULL COMMENT "" ) ENGINE=OLAP DUPLICATE KEY(`LO_ORDERDATE`, `LO_ORDERKEY`) COMMENT "OLAP" PARTITION BY RANGE(`LO_ORDERDATE`) (PARTITION p1 VALUES [('0000-01-01'), ('1993-01-01')), PARTITION p2 VALUES [('1993-01-01'), ('1994-01-01')), PARTITION p3 VALUES [('1994-01-01'), ('1995-01-01')), PARTITION p4 VALUES [('1995-01-01'), ('1996-01-01')), PARTITION p5 VALUES [('1996-01-01'), ('1997-01-01')), PARTITION p6 VALUES [('1997-01-01'), ('1998-01-01')), PARTITION p7 VALUES [('1998-01-01'), ('1999-01-01'))) DISTRIBUTED BY HASH(`LO_ORDERKEY`) BUCKETS 48 PROPERTIES ( "replication_allocation" = "tag.location.default: 1", "in_memory" = "false", "storage_format" = "V2" ) insert into lineorder_flat_string select * from lineorder_flat; ``` 2 Query Test Two BEs, 125G memory, 48 logic cores. Table rows is 600 million. 2.1 non-vectorization ``` mysql> set enable_vectorized_engine = false; Query OK, 0 rows affected (0.00 sec) mysql> select count(distinct LO_ORDERKEY) from lineorder_flat_string; +-------------------------------+ | count(DISTINCT `LO_ORDERKEY`) | +-------------------------------+ | 150000000 | +-------------------------------+ 1 row in set (1 min 10.94 sec) ``` 2.2 using flat_hash_map ``` mysql> select count(distinct LO_ORDERKEY) from lineorder_flat_string; +-------------------------------+ | count(DISTINCT `LO_ORDERKEY`) | +-------------------------------+ | 150000000 | +-------------------------------+ 1 row in set (2 min 23.54 sec) ``` 2.3 using parallel_flat_hash_map ``` mysql> select count(distinct LO_ORDERKEY) from lineorder_flat_string; +-------------------------------+ | count(DISTINCT `LO_ORDERKEY`) | +-------------------------------+ | 150000000 | +-------------------------------+ 1 row in set (57.97 sec) ``` ### Solution _No response_ ### Are you willing to submit PR? - [X] Yes I am willing to submit a PR! ### Code of Conduct - [X] I agree to follow this project's [Code of Conduct](https://www.apache.org/foundation/policies/conduct) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org For additional commands, e-mail: commits-h...@doris.apache.org