kzh1458003655-web opened a new pull request, #59016:
URL: https://github.com/apache/doris/pull/59016
### What changes were made in this pull request?
Implement the Levenshtein distance function (`levenshtein(string_a,
string_b)`) in the Vectorized Engine.
This function computes the minimum number of single-character edits
(insertions, deletions or substitutions) required to change one word into the
other.
### Why are these changes needed?
To provide users with a common string function used for fuzzy string
matching, data cleaning, and calculating string similarity.
### Detailed implementation details
1. **UTF-8 Support:** The implementation converts the input strings (which
may contain multi-byte UTF-8 characters) into Unicode Code Points (UTF-32
integers) before performing the dynamic programming (DP) calculation. This
ensures that the distance is calculated based on the number of *characters*,
not the number of *bytes*.
2. **Vectorized Integration:**
* Implemented the logic within `LevenshteinImpl` in `function_string.h`.
* Used `std::string_view` as the input type for `execute` to align with
modern Doris versions.
* Registered the function using the `FunctionBinaryToType` alias
(`FunctionStringLevenshtein`) along with the `LevenshteinWrapper` adapter in
`function_string.cpp`.
3. **Tests:** Added a comprehensive regression test case
(`test_string_function_levenshtein.groovy`) covering standard, boundary (empty
string), NULL, and critical UTF-8 (Chinese characters) scenarios.
### Note for Reviewers
* The function returns the distance as `INT`.
* The implementation leverages `std::vector<int32_t>` for DP row storage and
uses `thread_local` optimization for memory reuse.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]