github-actions[bot] commented on code in PR #60892: URL: https://github.com/apache/doris/pull/60892#discussion_r2878195116
########## regression-test/suites/nereids_p0/sql_functions/string_functions/test_split_by_string_limit.groovy: ########## @@ -0,0 +1,93 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +suite("test_split_by_string_limit") { + sql "SET enable_nereids_planner=true" + sql "SET enable_fallback_to_original_planner=false" + + // === Constant expression tests with limit === + + // Basic limit functionality + qt_limit1 "select split_by_string('one,two,three,', ',', 2);" + qt_limit2 "select split_by_string('one,two,three,', ',', 3);" + qt_limit3 "select split_by_string('one,two,three,', ',', 4);" + qt_limit4 "select split_by_string('one,two,three,', ',', 10);" + qt_limit5 "select split_by_string('one,two,three', ',', 1);" + + // limit = -1 (no limit, same as 2-arg) + qt_limit6 "select split_by_string('one,two,three,', ',', -1);" + + // limit = 0 (no limit, same as 2-arg) + qt_limit7 "select split_by_string('a,b,c', ',', 0);" + + // Empty source string + limit + qt_limit8 "select split_by_string('', ',', 2);" + + // Empty delimiter + limit (split by character) + qt_limit9 "select split_by_string('abcde', '', 3);" + qt_limit10 "select split_by_string('abcde', '', 1);" + qt_limit11 "select split_by_string('abcde', '', 10);" + + // Multi-char delimiter + limit + qt_limit12 "select split_by_string('a::b::c::d', '::', 2);" + qt_limit13 "select split_by_string('a::b::c::d', '::', 3);" + qt_limit14 "select split_by_string('1,,2,3,,4,5,,abcde', ',,', 2);" + + // NULL handling + qt_limit15 "select split_by_string(NULL, ',', 2);" + + // UTF-8 + limit + qt_limit16 "select split_by_string('你a好b世c界', '', 3);" + + // Edge cases: consecutive delimiters + limit + qt_limit17 "select split_by_string(',,,', ',', 2);" + qt_limit18 "select split_by_string(',,a,b,c,', ',', 3);" + + // === Table data tests === + def tableName = "test_split_limit" + Review Comment: **[Low]** Per coding standard: "For ordinary single test tables, do not use `def tableName` form; instead hardcode your table name in all SQL." Please replace `${tableName}` with the literal `test_split_limit` in all SQL statements. ########## regression-test/suites/nereids_p0/sql_functions/string_functions/test_split_by_string_limit.groovy: ########## @@ -0,0 +1,93 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +suite("test_split_by_string_limit") { + sql "SET enable_nereids_planner=true" + sql "SET enable_fallback_to_original_planner=false" Review Comment: **[Low]** Per coding standard: "Except for variables you explicitly need to adjust for testing current functionality, other variables do not need extra setup before testing. For example, nereids optimizer and pipeline engine settings can use default states." These two SET statements are unnecessary since nereids is the default planner. Please remove lines 19-20. ########## be/src/vec/functions/function_string.h: ########## @@ -2163,44 +2158,142 @@ class FunctionSplitByString : public IFunction { } } - void split_empty_delimiter(const StringRef& str_ref, ColumnString::Chars& column_string_chars, - ColumnString::Offsets& column_string_offsets, - ColumnArray::Offset64& string_pos, - ColumnArray::Offset64& dest_pos) const { + static void split_empty_delimiter(const StringRef& str_ref, + ColumnString::Chars& column_string_chars, + ColumnString::Offsets& column_string_offsets, + ColumnArray::Offset64& string_pos, + ColumnArray::Offset64& dest_pos, Int32 limit_value) { const size_t old_size = column_string_chars.size(); const size_t new_size = old_size + str_ref.size; column_string_chars.resize(new_size); memcpy(column_string_chars.data() + old_size, str_ref.data, str_ref.size); - if (simd::VStringFunctions::is_ascii(str_ref)) { - const auto size = str_ref.size; - - const auto nested_old_size = column_string_offsets.size(); - const auto nested_new_size = nested_old_size + size; - column_string_offsets.resize(nested_new_size); - std::iota(column_string_offsets.data() + nested_old_size, - column_string_offsets.data() + nested_new_size, string_pos + 1); - - string_pos += size; - dest_pos += size; - // The above code is equivalent to the code in the following comment. - // for (size_t i = 0; i < str_ref.size; i++) { - // string_pos++; - // column_string_offsets.push_back(string_pos); - // (*dest_nested_null_map).push_back(false); - // dest_pos++; - // } + + if (limit_value > 0) { + // With limit: split character by character up to limit-1, then remainder + Int32 split_count = 0; + size_t i = 0; + if (simd::VStringFunctions::is_ascii(str_ref)) { + for (; i < str_ref.size; i++) { + if (split_count == limit_value - 1) { + // remainder + string_pos += str_ref.size - i; + column_string_offsets.push_back(string_pos); + dest_pos++; + return; + } + string_pos++; + column_string_offsets.push_back(string_pos); + dest_pos++; + split_count++; + } + } else { + for (size_t utf8_char_len = 0; i < str_ref.size; i += utf8_char_len) { + utf8_char_len = UTF8_BYTE_LENGTH[(unsigned char)str_ref.data[i]]; + if (split_count == limit_value - 1) { + // remainder + string_pos += str_ref.size - i; + column_string_offsets.push_back(string_pos); + dest_pos++; + return; + } + string_pos += utf8_char_len; + column_string_offsets.push_back(string_pos); + dest_pos++; + split_count++; + } + } } else { - for (size_t i = 0, utf8_char_len = 0; i < str_ref.size; i += utf8_char_len) { - utf8_char_len = UTF8_BYTE_LENGTH[(unsigned char)str_ref.data[i]]; + // No limit: original behavior + if (simd::VStringFunctions::is_ascii(str_ref)) { + const auto size = str_ref.size; + + const auto nested_old_size = column_string_offsets.size(); + const auto nested_new_size = nested_old_size + size; + column_string_offsets.resize(nested_new_size); + std::iota(column_string_offsets.data() + nested_old_size, + column_string_offsets.data() + nested_new_size, string_pos + 1); + + string_pos += size; + dest_pos += size; + } else { + for (size_t i = 0, utf8_char_len = 0; i < str_ref.size; i += utf8_char_len) { + utf8_char_len = UTF8_BYTE_LENGTH[(unsigned char)str_ref.data[i]]; - string_pos += utf8_char_len; - column_string_offsets.push_back(string_pos); - dest_pos++; + string_pos += utf8_char_len; + column_string_offsets.push_back(string_pos); + dest_pos++; + } } } } }; +struct SplitByStringTwoArgImpl { + static DataTypes get_variadic_argument_types() { + return {std::make_shared<DataTypeString>(), std::make_shared<DataTypeString>()}; + } + + static Status execute_impl(FunctionContext* /*context*/, Block& block, + const ColumnNumbers& arguments, uint32_t result, + size_t input_rows_count) { + DCHECK_EQ(arguments.size(), 2); + return SplitByStringExecutor::execute_core(block, arguments, result, input_rows_count, -1); + } +}; + +struct SplitByStringThreeArgImpl { + static DataTypes get_variadic_argument_types() { + return {std::make_shared<DataTypeString>(), std::make_shared<DataTypeString>(), + std::make_shared<DataTypeInt32>()}; + } + + static Status execute_impl(FunctionContext* /*context*/, Block& block, + const ColumnNumbers& arguments, uint32_t result, + size_t input_rows_count) { + DCHECK_EQ(arguments.size(), 3); + const auto& [limit_column, limit_is_const] = + unpack_if_const(block.get_by_position(arguments[2]).column); + auto limit_value = assert_cast<const ColumnInt32&>(*limit_column).get_element(0); + return SplitByStringExecutor::execute_core(block, arguments, result, input_rows_count, Review Comment: **[Medium]** `limit_is_const` is extracted from `unpack_if_const` but never checked. Per Doris coding standards ("assert correctness, never use defensive programming"), this should have a `DCHECK(limit_is_const)` assertion to catch cases where a non-constant limit column reaches the BE (e.g., via planner bugs or future code paths). Without it, if `limit_is_const` is `false`, every row silently uses the first row's limit value, producing incorrect results. Suggested fix: ```cpp const auto& [limit_column, limit_is_const] = unpack_if_const(block.get_by_position(arguments[2]).column); DCHECK(limit_is_const) << "limit parameter of split_by_string must be constant"; auto limit_value = assert_cast<const ColumnInt32&>(*limit_column).get_element(0); ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
