liujiwen-up commented on code in PR #41681: URL: https://github.com/apache/doris/pull/41681#discussion_r1800408496
########## be/src/vec/functions/function_string.cpp: ########## @@ -535,6 +545,42 @@ struct TrimUtil { return Status::OK(); } }; +template <bool is_ltrim_in, bool is_rtrim_in, bool trim_single> +struct TrimInUtil { + static Status vector(const ColumnString::Chars& str_data, + const ColumnString::Offsets& str_offsets, const StringRef& remove_str, + ColumnString::Chars& res_data, ColumnString::Offsets& res_offsets) { + const size_t offset_size = str_offsets.size(); + res_offsets.resize(offset_size); + res_data.reserve(str_data.size()); + std::bitset<256> char_lookup; Review Comment: Currently, there are two processing logics. 1. The `simd::VStringFunctions::is_ascii `method is used to judge that when the string is all ascii, bitset<128> will be used for processing. The character range of the standard ASCII table is 0 to 127. Bitset<128> has exactly 128 bits, which is enough to represent all standard ASCII characters. 2. When the string is not all standard ascii, the utf-8 logic will be used for processing.It is especially important to note that when trimming on the right, according to the rules of UTF-8, the format of the UTF-8 trailing byte is 10xxxxxx. Use `(*prev_char_pos & 0xC0) == 0x80` to determine whether the current byte is a trailing byte. When a byte that is not a trailing byte is found, this byte is the starting byte of the current character. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org For additional commands, e-mail: commits-h...@doris.apache.org