[GitHub] [doris] xiaokang commented on pull request #10169: optimize substr performance

GitBox Sun, 19 Jun 2022 22:29:33 -0700


xiaokang commented on PR #10169:
URL: https://github.com/apache/doris/pull/10169#issuecomment-1159990261


   Here are some detailed explanation for the test.
   
   **test env**
   
   - test data: 25GB text log, 110 million rows
   - test table: test_table(ts varchar(30), log string)
   - test SQL 1: select count() from belog_lz4fix where log like '%minidump%'; 
the result is 1
   - test SQL 2: select count() from belog_lz4fix where log like 
'%vaggregation\_node.cpp%'; the result is 1201154
   - be.conf: disable_storage_page_cache = true
   set this config to disable doris page cache to avoid all data cached in 
memory for test real decompression speed.
   test result
   
   
   **test result**
   
   - No.1 : the original version, using std::search, std::default_searcher 
algorithm and calling search once for each single row
   - No.2,3,4: change search algorithm to std::boyer_moore_searcher, 
std::boyer_moore_horspool_searcher, volnitsky respectively.
   - No.5,6,7,8: corresponding to 1,2,3,4, change to call search across multi 
rows. It leverages the continuous memory layout of string column and benefits 
from long run length miss match.
   
   **analysis**
   
   - compare No.4 with No.1: replacing std::search with volnitsky get 1.51x 
speedup.
   - compare No.5 with No.1: replacing search single row with search across 
rows for std::search, get 1.25x speedup.
   - compare No.8 with No.4: replacing search single row with search across 
rows for volnitsky, the speedup grows from 1.51x to 2.04x.
   - compare No.2,3 with No.1, No.6,7 with 5: there are little difference 
between std::default_searcher, std::boyer_moore_searcher, 
std::boyer_moore_horspool_searcher.
   - compare the two SQL: the more miss match (or less match), the more speedup
   
   **conclusion**
   For the test data and test case:
   1. volnitsky algrithm with search across rows is the best.
   2. std::boyer_moore_horspool_searcher is better than std::default_searcher
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org
For additional commands, e-mail: commits-h...@doris.apache.org

[GitHub] [doris] xiaokang commented on pull request #10169: optimize substr performance

Reply via email to