azagniotov commented on PR #12517: URL: https://github.com/apache/lucene/pull/12517#issuecomment-2513057933
@mocobeta @johtani Hello! I wanted to touch base within the context of the current PR, I am observing an interesting issue creating tokenizer using: - Setting `discardCompoundToken` as `false` - Mode.SEARCH - `*.dat` files generated from [unidic-cwj-202302_full](https://clrd.ninjal.ac.jp/unidic_archive/2302/): I no longer see compound token being emitted, but only the short units. For example, when using the the following *.dat files and running a unit test, the tokenization results are as follows: - Default MeCab: `関西国際空港` => `"関西", "関西国際空港", "国際", "空港"` - unidic-cwj-202302_full: `関西国際空港` => `"関西", "国際", "空港"` Is compound token emitting behavior specific to MeCab dictionary in the implementation of Kuromoji / Viterbi? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org