airborne12 opened a new pull request, #284: URL: https://github.com/apache/doris-thirdparty/pull/284
… This pull request includes several changes to improve the handling of UTF-8 encoding in the `CLucene` library and adds new tests to ensure the correctness of these changes. The most important changes include modifications to the `IndexInput` and `IndexOutput` classes to handle UTF-8 encoding more accurately and the addition of new test cases for UTF-8 characters. ### Improvements to UTF-8 encoding handling: * [`src/core/CLucene/store/IndexInput.cpp`](diffhunk://#diff-67ecca0c03c369fefa9a51e2f56262efc49a687e097d57ebeab1e78eeb869d72L138-R150): Modified the handling of byte sequences to differentiate between incorrect and correct UTF-8 encoding, providing a temporary solution to handle 4-byte characters. * [`src/core/CLucene/store/IndexOutput.cpp`](diffhunk://#diff-64611e13ecbcf9b6e9b84e045e7bf35be98a9da95c04b7a71e075399a60ec888L179-R183): Updated the writing of byte sequences for 4-byte characters to differentiate between incorrect and correct UTF-8 encoding, providing a temporary solution. [[1]](diffhunk://#diff-64611e13ecbcf9b6e9b84e045e7bf35be98a9da95c04b7a71e075399a60ec888L179-R183) [[2]](diffhunk://#diff-64611e13ecbcf9b6e9b84e045e7bf35be98a9da95c04b7a71e075399a60ec888L216-R224) ### Addition of new test cases: * [`src/test/CMakeLists.txt`](diffhunk://#diff-921b2054f6bf380eb08d5c3c21cf8d1c7cfee3736227d611400ae1a13ab3d187R113): Added `TestUTF8Chars.cpp` to the list of test files to be compiled. * [`src/test/test.h`](diffhunk://#diff-993fc9d73840fa074470653e9f8a1e53afc4388b8bc671cd28ecbcfbea8b97b1R93): Declared the `testUTF8CharsSuite` function to include the new UTF-8 character tests. * [`src/test/tests.cpp`](diffhunk://#diff-f21ef3314c226873fefc19da14a6e0561cbac2d2ec0b7ef8eb022d4edc2a25a1L9-R9): Added `TestUTF8Chars` to the list of unit tests to be executed. [[1]](diffhunk://#diff-f21ef3314c226873fefc19da14a6e0561cbac2d2ec0b7ef8eb022d4edc2a25a1L9-R9) [[2]](diffhunk://#diff-f21ef3314c226873fefc19da14a6e0561cbac2d2ec0b7ef8eb022d4edc2a25a1R26) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org For additional commands, e-mail: commits-h...@doris.apache.org