airborne12 commented on code in PR #64667:
URL: https://github.com/apache/doris/pull/64667#discussion_r3489990093
##########
be/CMakeLists.txt:
##########
@@ -318,6 +318,12 @@ install(DIRECTORY
${BASE_DIR}/dict/pinyin
DESTINATION ${OUTPUT_DIR}/dict)
+# Japanese kuromoji dictionary
+install(DIRECTORY
+ ${BASE_DIR}/dict/kuromoji
+ DESTINATION ${OUTPUT_DIR}/dict
+ OPTIONAL)
Review Comment:
[Blocking] This install rule does not guarantee that the runtime Kuromoji
dictionary is present. The generated `dict/kuromoji/*.bin` files are
ignored/not committed, `kuromoji_build_dict` is `EXCLUDE_FROM_ALL`, and
`kuromoji_dict` is only a manual target. A default package can therefore ship
only `dict/kuromoji/README.md`, while the BE later loads
`${inverted_index_dict_path}/kuromoji`.
Please make package/install depend on dictionary generation and fail if
`system.bin`, `matrix.bin`, `chardef.bin`, and `unkdict.bin` are missing.
`OPTIONAL` should not hide a missing required analyzer artifact.
##########
be/src/storage/index/inverted/analyzer/kuromoji/KuromojiAnalyzer.h:
##########
@@ -0,0 +1,73 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements. See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership. The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License. You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied. See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#pragma once
+
+#include <memory>
+#include <string>
+
+#include "common/logging.h"
+#include "storage/index/inverted/analyzer/kuromoji/KuromojiTokenizer.h"
+#include "storage/index/inverted/analyzer/kuromoji/dict/kuromoji_dictionary.h"
+
+namespace doris::segment_v2 {
+
+class KuromojiAnalyzer : public Analyzer {
+public:
+ KuromojiAnalyzer() {
+ _lowercase = true;
+ _ownReader = false;
+ }
+ ~KuromojiAnalyzer() override = default;
+
+ bool isSDocOpt() override { return true; }
+
+ // Loads (once, process-wide) the IPADIC dictionary from `dictPath`. If it
is
+ // unavailable the tokenizer degrades to a per-codepoint split (logged),
rather
+ // than failing index/query.
+ void initDict(const std::string& dictPath) override {
Review Comment:
[Blocking] Missing or corrupt dictionary should not silently fall back for
the production `kuromoji` parser. If indexing runs with `dict_ == nullptr`,
segments are written with per-codepoint tokens; after the dictionary is
installed/reloaded, query analyzers can produce real Kuromoji tokens, so old
and new segments have different tokenization semantics.
Please fail analyzer creation/index/query with a clear error when the
Kuromoji dictionary cannot be loaded. If fallback tokenization is desired,
expose it as an explicit parser/mode persisted in index metadata.
##########
be/src/storage/index/inverted/inverted_index_parser.cpp:
##########
@@ -89,8 +93,13 @@ std::string get_parser_mode_string_from_properties(
if (parser_it == properties.end()) {
parser_it = properties.find(INVERTED_INDEX_PARSER_KEY_ALIAS);
}
- if (parser_it != properties.end() && parser_it->second ==
INVERTED_INDEX_PARSER_IK) {
- return INVERTED_INDEX_PARSER_SMART;
+ if (parser_it != properties.end()) {
+ if (parser_it->second == INVERTED_INDEX_PARSER_IK) {
+ return INVERTED_INDEX_PARSER_SMART;
+ }
+ if (parser_it->second == INVERTED_INDEX_PARSER_KUROMOJI) {
+ return INVERTED_INDEX_PARSER_KUROMOJI_SEARCH;
Review Comment:
[Major] BE now defaults omitted `parser_mode` for `parser=kuromoji` to
`search`, but FE `InvertedIndexProperties.getInvertedIndexParserMode()` still
only special-cases IK and otherwise returns `coarse_grained`. Match predicate
thrift serialization uses the FE helper, so FE and BE disagree on the effective
default.
Please update the FE default helper to return `search` for Kuromoji and add
FE coverage for a Kuromoji index/query without an explicit `parser_mode`.
##########
be/src/storage/index/inverted/analyzer/kuromoji/dict/kuromoji_dictionary.cpp:
##########
@@ -0,0 +1,202 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements. See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership. The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License. You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied. See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "storage/index/inverted/analyzer/kuromoji/dict/kuromoji_dictionary.h"
+
+#include <fcntl.h>
+#include <sys/mman.h>
+#include <sys/stat.h>
+#include <unistd.h>
+
+#include <cstring>
+#include <map>
+#include <mutex>
+
+#include "common/logging.h"
+
+namespace doris::segment_v2::inverted_index::kuromoji {
+
+MappedFile::~MappedFile() {
+ if (_data != nullptr) {
+ ::munmap(const_cast<uint8_t*>(_data), _size);
+ _data = nullptr;
+ _size = 0;
+ }
+}
+
+Status MappedFile::open(const std::string& path) {
+ int fd = ::open(path.c_str(), O_RDONLY);
+ if (fd < 0) {
+ return Status::IOError("kuromoji dict: cannot open {}", path);
+ }
+ struct stat st {};
+ if (::fstat(fd, &st) != 0 || st.st_size <= 0) {
+ ::close(fd);
+ return Status::IOError("kuromoji dict: cannot stat {}", path);
+ }
+ auto bytes = static_cast<std::size_t>(st.st_size);
+ void* m = ::mmap(nullptr, bytes, PROT_READ, MAP_PRIVATE, fd, 0);
+ ::close(fd);
+ if (m == MAP_FAILED) {
+ return Status::IOError("kuromoji dict: mmap failed for {}", path);
+ }
+ _data = static_cast<const uint8_t*>(m);
+ _size = bytes;
+ return Status::OK();
+}
+
+Status KuromojiDictionary::check_header(const uint8_t* p, std::size_t size,
KmjFileKind kind) {
+ if (size < sizeof(KmjFileHeader)) {
+ return Status::Corruption("kuromoji dict: file too small");
+ }
+ KmjFileHeader h {};
+ std::memcpy(&h, p, sizeof(h));
+ if (std::memcmp(h.magic, KMJ_MAGIC, sizeof(h.magic)) != 0) {
+ return Status::Corruption("kuromoji dict: bad magic");
+ }
+ if (h.format_version != KMJ_FORMAT_VERSION) {
+ return Status::Corruption("kuromoji dict: version {} != {}",
h.format_version,
+ KMJ_FORMAT_VERSION);
+ }
+ if (h.file_kind != static_cast<uint32_t>(kind)) {
+ return Status::Corruption("kuromoji dict: wrong file_kind {}",
h.file_kind);
+ }
+ if (h.file_size != size) {
+ return Status::Corruption("kuromoji dict: file_size {} != actual {}",
h.file_size, size);
+ }
+ return Status::OK();
+}
+
+std::string_view KuromojiDictionary::feature_at(const uint8_t* blob, uint64_t
blob_bytes,
+ uint32_t off) {
+ if (off == KMJ_NO_FEATURE || blob == nullptr || static_cast<uint64_t>(off)
+ 2 > blob_bytes) {
+ return {};
+ }
+ auto len = static_cast<uint16_t>(static_cast<uint16_t>(blob[off]) |
+ static_cast<uint16_t>(blob[off + 1] <<
8));
+ if (static_cast<uint64_t>(off) + 2 + len > blob_bytes) {
+ return {};
+ }
+ return {reinterpret_cast<const char*>(blob + off + 2), len};
+}
+
+Status KuromojiDictionary::map_system(const std::string& path) {
+ RETURN_IF_ERROR(_system_map.open(path));
+ const uint8_t* p = _system_map.data();
+ RETURN_IF_ERROR(check_header(p, _system_map.size(), KMJ_KIND_SYSTEM));
+ KmjSystemHeader s {};
+ std::memcpy(&s, p + sizeof(KmjFileHeader), sizeof(s));
Review Comment:
[Blocking] `check_header()` only validates the common header, but the loader
then trusts all sub-header offsets/counts and installs mmap pointers from them.
A header-valid but truncated/corrupt `*.bin` can make these `memcpy`/pointer
assignments and later accessors read past the mmap.
Please validate each sub-header before exposing pointers: `sizeof(header +
subheader)`, every `offset + count * sizeof(T)` range, integer overflow, trie
byte alignment, `class_count == CAT_CLASS_COUNT`, matrix dimensions/cell count,
run ranges into entry arrays, feature offsets, and word left/right IDs against
matrix bounds. Return `Status::Corruption` for invalid artifacts.
##########
be/src/storage/index/inverted/analyzer/kuromoji/KuromojiMode.h:
##########
@@ -0,0 +1,41 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements. See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership. The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License. You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied. See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#pragma once
+
+#include <string>
+
+namespace doris::segment_v2 {
+
+// Segmentation mode, mirroring Lucene's JapaneseTokenizer.Mode. Normal returns
+// the minimum-cost segmentation. Search additionally decomposes long compounds
+// into their shorter parts (via a length-based cost penalty) for better search
+// recall. Extended applies the Search penalty and also splits unknown
+// (out-of-vocabulary) words into per-character unigrams.
+enum class KuromojiMode { Normal, Search, Extended };
+
+inline KuromojiMode kuromoji_mode_from_string(const std::string& mode) {
+ if (mode == "normal") {
+ return KuromojiMode::Normal;
+ }
+ if (mode == "extended") {
+ return KuromojiMode::Extended;
+ }
+ return KuromojiMode::Search; // default (matches OpenSearch/Lucene)
Review Comment:
[Major] This silently maps any unknown Kuromoji mode to `Search`. That makes
`TOKENIZE(..., '"parser"="kuromoji","parser_mode"="bogus"')` accepted and
executed as `search`, while index DDL rejects the same value.
Please make mode parsing return an error/status for unknown values, and
apply the same Kuromoji property validation to `TOKENIZE` as DDL/index creation.
##########
regression-test/suites/inverted_index_p0/analyzer/test_japanese_analyzer.groovy:
##########
@@ -0,0 +1,76 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements. See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership. The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License. You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied. See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+suite("test_japanese_analyzer", "p0") {
+ def tableName = "test_japanese_analyzer"
+
+ def backendId_to_backendIP = [:]
+ def backendId_to_backendHttpPort = [:]
+ getBackendIpHttpPort(backendId_to_backendIP, backendId_to_backendHttpPort)
+ def set_be_config = { key, value ->
+ for (String backend_id : backendId_to_backendIP.keySet()) {
+ update_be_config(backendId_to_backendIP.get(backend_id),
+ backendId_to_backendHttpPort.get(backend_id),
key, value)
+ }
+ }
+
+ sql "DROP TABLE IF EXISTS ${tableName}"
+ // kuromoji is disabled by default; enable it for this test.
+ set_be_config("enable_kuromoji_analyzer", "true")
+ try {
+ sql """
+ CREATE TABLE ${tableName} (
+ `id` int(11) NULL COMMENT "",
+ `content` text NULL COMMENT "",
+ INDEX content_idx (`content`) USING INVERTED PROPERTIES("parser" =
"kuromoji", "parser_mode" = "search") COMMENT '',
+ ) ENGINE=OLAP
+ DUPLICATE KEY(`id`)
+ COMMENT "OLAP"
+ DISTRIBUTED BY RANDOM BUCKETS 1
+ PROPERTIES (
+ "replication_allocation" = "tag.location.default: 1"
+ );
+ """
+
+ sql """ INSERT INTO ${tableName} VALUES (1, "東京都に住んでいます"); """
+ sql """ INSERT INTO ${tableName} VALUES (2, "私は寿司が好きです"); """
+ sql """ INSERT INTO ${tableName} VALUES (3, "Apache Doris は高速です"); """
+ sql "sync"
+
+ // The kuromoji dictionary is not shipped in the p0 package, so the
Review Comment:
[Blocking] This regression is proving the fallback path, not the feature
added by this PR. Because the p0 package does not ship the dictionary, CI can
pass while the real Kuromoji/IPADIC morphology path is never exercised.
Please add a required CI/regression path that generates/ships the dictionary
and asserts real morphology behavior, e.g. search-mode decompound (`東京都`
matching `東京`), base-form/POS behavior, and extended-mode unknown-word
splitting. Keep fallback coverage separate if fallback remains explicit.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]