Re: [PR] LUCENE-4056: Japanese Tokenizer (Kuromoji) cannot build UniDic dictionary [lucene]

via GitHub Mon, 02 Dec 2024 14:13:56 -0800


azagniotov commented on PR #12517:
URL: https://github.com/apache/lucene/pull/12517#issuecomment-2513057933


   @mocobeta @johtani  Hello!
   
   I wanted to touch base within the context of the current PR, I am observing 
an interesting issue creating tokenizer using:
   - Setting `discardCompoundToken` as `false`
   - Mode.SEARCH
   -  `*.dat` files generated from 
[unidic-cwj-202302_full](https://clrd.ninjal.ac.jp/unidic_archive/2302/):
   
   I no longer see compound token being emitted, but only the short units. For 
example, when using the the following *.dat files and running a unit test, the 
tokenization results are as follows:
   - Default MeCab: `関西国際空港` => `"関西", "関西国際空港", "国際", "空港"`
   - unidic-cwj-202302_full: `関西国際空港` => `"関西", "国際", "空港"`
   
   
   Is compound token emitting behavior specific to MeCab dictionary in the 
implementation of Kuromoji / Viterbi?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] LUCENE-4056: Japanese Tokenizer (Kuromoji) cannot build UniDic dictionary [lucene]

Reply via email to