xiaokang commented on PR #10169: URL: https://github.com/apache/doris/pull/10169#issuecomment-1159990261
Here are some detailed explanation for the test. **test env** - test data: 25GB text log, 110 million rows - test table: test_table(ts varchar(30), log string) - test SQL 1: select count() from belog_lz4fix where log like '%minidump%'; the result is 1 - test SQL 2: select count() from belog_lz4fix where log like '%vaggregation\_node.cpp%'; the result is 1201154 - be.conf: disable_storage_page_cache = true set this config to disable doris page cache to avoid all data cached in memory for test real decompression speed. test result **test result** - No.1 : the original version, using std::search, std::default_searcher algorithm and calling search once for each single row - No.2,3,4: change search algorithm to std::boyer_moore_searcher, std::boyer_moore_horspool_searcher, volnitsky respectively. - No.5,6,7,8: corresponding to 1,2,3,4, change to call search across multi rows. It leverages the continuous memory layout of string column and benefits from long run length miss match. **analysis** - compare No.4 with No.1: replacing std::search with volnitsky get 1.51x speedup. - compare No.5 with No.1: replacing search single row with search across rows for std::search, get 1.25x speedup. - compare No.8 with No.4: replacing search single row with search across rows for volnitsky, the speedup grows from 1.51x to 2.04x. - compare No.2,3 with No.1, No.6,7 with 5: there are little difference between std::default_searcher, std::boyer_moore_searcher, std::boyer_moore_horspool_searcher. - compare the two SQL: the more miss match (or less match), the more speedup **conclusion** For the test data and test case: 1. volnitsky algrithm with search across rows is the best. 2. std::boyer_moore_horspool_searcher is better than std::default_searcher -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org For additional commands, e-mail: commits-h...@doris.apache.org