Elbek Kamoliddinov created LUCENE-9100: ------------------------------------------
Summary: JapaneseTokenizer produces inconsistent tokens Key: LUCENE-9100 URL: https://issues.apache.org/jira/browse/LUCENE-9100 Project: Lucene - Core Issue Type: Bug Components: modules/analysis Affects Versions: 7.2 Reporter: Elbek Kamoliddinov We use {{JapaneseTokenizer}} on prod and seeing some inconsistent behavior. With this text: {{"マギアリス【単版話】 4話 (Unlimited Comics)"}} I get different results if I insert space before `【` char. Here is the small code snippet demonstrating the case (not we use our own dictionary and connection costs): {code:java} Analyzer analyzer = new Analyzer() { @Override protected TokenStreamComponents createComponents(String fieldName) { // Tokenizer tokenizer = new JapaneseTokenizer(newAttributeFactory(), null, true, JapaneseTokenizer.Mode.SEARCH); Tokenizer tokenizer = new JapaneseTokenizer(newAttributeFactory(), dictionaries.systemDictionary, dictionaries.unknownDictionary, dictionaries.connectionCosts, null, true, JapaneseTokenizer.Mode.SEARCH); return new TokenStreamComponents(tokenizer, new LowerCaseFilter(tokenizer)); } }; String text1 = "マギアリス【単版話】 4話 (Unlimited Comics)"; String text2 = "マギアリス 【単版話】 4話 (Unlimited Comics)"; //inserted space try (TokenStream tokens = analyzer.tokenStream("field", new StringReader(text1))) { CharTermAttribute chars = tokens.addAttribute(CharTermAttribute.class); tokens.reset(); while (tokens.incrementToken()) { System.out.println(chars.toString()); } tokens.end(); } catch (IOException e) { // should never happen with a StringReader throw new RuntimeException(e); } {code} Output is: {code:java} //text1 マギ アリス 単 版 話 4 話 unlimited comics //text2 マギア リス 単 版 話 4 話 unlimited comics{code} It looks like tokenizer doesn't view the punctuation (\{{【}} is \{{Character.START_PUNCTUATION}} type) as an indicator that there should be a token break, and somehow 【 punctuation char causes difference in the output. If I use the {{JapaneseTokenizer}} tokenizer then this problem doesn't manifest because it doesn't tokenize {{マギアリス}} into multiple tokens and outputs as is. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org