azagniotov opened a new pull request, #12517:
URL: https://github.com/apache/lucene/pull/12517

   ## TL;DR
   
   The current PR attempts to remediate 
https://issues.apache.org/jira/browse/LUCENE-4056 (`Japanese Tokenizer 
(Kuromoji) cannot build UniDic dictionary`)
   
   ## Description
   
   <!--
   If this is your first contribution to Lucene, please make sure you have 
reviewed the contribution guide.
   https://github.com/apache/lucene/blob/main/CONTRIBUTING.md
   -->
   
   The current PR builds up upon @johtani's past PR: 
https://github.com/apache/lucene-solr/pull/935 and attempts to bring the code 
into a mergeable state, or at least to re-ignite the conversation about 
building a UniDic dictionary. 
   
   Before exporting the current PR, I verified the @johtani's added behavior in 
the aforementioned PR (plus some changes of my own) by successfully building a 
number of dictionaries outlined below and posted my findings in the comment: 
https://github.com/apache/lucene-solr/pull/935#issuecomment-1685887305 
   
   ### Philosophy of changes
   
   I tried to pick up @johtani's added behavior as-is, while minimizing the 
amount of changes that I added additionally.
   
   To be honest, I do not really like how `DictionaryBuilder.DictionaryFormat 
format` is being passed everywhere, as it creates these small decision trees 
`if IPADIC do this else do that`. If this PR gets merged, I would be happy to 
refactor in order to make the code look more elegant and perhaps with better 
separation of concerns. But, I defer to the reviewers of the Lucene library on 
this one. ๐Ÿ™‡๐Ÿผโ€โ™€๏ธ 
   
   The commits are fairly atomic and I hope this can make the reviewing 
experience more easier
   
   ## Built Dictionaries 
   
   - [unidic-cwj-202302_full](https://clrd.ninjal.ac.jp/unidic_archive/2302/) 
(NINJAL)
   - 
[unidic-cwj-3.1.1-full](https://clrd.ninjal.ac.jp/unidic_archive/cwj/3.1.1/) 
(NINJAL) (See **Caveat** below ๐Ÿ‘‡๐Ÿผ )
   - 
[unidic-mecab-2.1.2_src](https://clrd.ninjal.ac.jp/unidic_archive/cwj/2.1.2/) 
(NINJAL)
   - [mecab-ipadic-2.7.0-20070801](https://taku910.github.io/mecab/)
   
   ### Caveat
   RE: Building the 
[unidic-cwj-3.1.1-full](https://clrd.ninjal.ac.jp/unidic_archive/cwj/3.1.1/): 
   
   1. I had to increase the the [length check of the 
baseForm](https://github.com/apache/lucene/commit/4bb6000cdc85aa4cc266d0a32a149d4d517d2e95)
 from `16` to `35`
   2. I had to stop [throwing an exception for multiple entries of 
LeftID](https://github.com/apache/lucene/commit/011efbe480dce5eafc2c81cdbf44a3fc2f07b61c),
 which, to be honest I do not understand the full ramifications of. Thus, I 
would value the input of subject matter experts here ๐Ÿ™‡๐Ÿผโ€โ™€๏ธ 
   
   ## Building a dictionary
   
   ### Gradle command
   
   My build command leverages the new Gradle setup and the [DictionaryBuilder 
JavaDoc 
comment](https://github.com/apache/lucene/commit/963ddc66f6d822bf877c7e6155d903babf937c13)
 about how to do it:
   
   I added in `lucene/analysis/kuromoji/build.gradle` a [default run task from 
the Gradle's application 
plugin](https://github.com/apache/lucene/commit/8d52f66d57b354fad6426c7c02b05a3b736d7dcd),
 which allows to build a dictionary are as follows. The command should be 
executed under the root directory `lucene`, where the `gradlew` file is.
   
   For example, the following is my command when building 
`unidic-cwj-202302_full` dictionary without NFKD normalization:
   ```
   ./gradlew -p lucene/analysis/kuromoji run --args='unidic 
"/Users/azagniotov/Downloads/unidic-cwj-202302_full" 
"/Users/azagniotov/Downloads/unidic-cwj-202302_full/lucene-kuromoji-built" 
"UTF-8" false'
   ```
   
   ## Unit testing
   
   Unfortunately, the current PR does not include unit tests because the built 
dictionaries files are very big, e.g.: built `unidic-cwj-202302_full ` is 
around ~700MB. Thus, the following unit tests were added and ran locally on my 
machine to verify that the built dictionaries can be used at runtime. 
   
   I did see there there is a 
[main/gradle/generation/kuromoji.gradle](https://github.com/apache/lucene/blob/main/gradle/generation/kuromoji.gradle)
 that downloads dictionaries and compiles them, but for now, I resorted not to 
add changes there as I think a dictionary should be decoupled.
   
   The built dictionaries were tested using the following Japanese strings (no 
particular reason, I just picked these four strings): 
   
   - `"ใซใ˜ใ•ใ‚“ใ˜"`
   - `"ใกใ„ใ‹ใ‚"`
   - `"ๆกƒๅคช้ƒŽ้›ป้‰„"`
   - `"่–ๅท็œŸๆ–—"`
   
   The dictionaries metadata  were placed (dictionary after dictionary) under 
the 
[lucene/analysis/kuromoji/src/resources/org/apache/lucene/analysis/ja/dict](https://github.com/apache/lucene/tree/main/lucene/analysis/kuromoji/src/resources/org/apache/lucene/analysis/ja/dict)
 and a few unit test cases were added: 
   
   #### Existing default dictionary already included in Lucene
   ```
   assertAnalyzesTo(analyzerNoPunct, "ใซใ˜ใ•ใ‚“ใ˜", new String[] {"ใซ", "ใ˜ใ•", "ใ‚“", 
"ใ˜"}, new int[] {1, 1, 1, 1});
   assertAnalyzesTo(analyzerNoPunct, "ใกใ„ใ‹ใ‚", new String[] {"ใกใ„", "ใ‹", "ใ‚"}, new 
int[] {1, 1, 1});
   assertAnalyzesTo(analyzerNoPunct, "ๆกƒๅคช้ƒŽ้›ป้‰„", new String[] {"ๆกƒๅคช้ƒŽ", "้›ป้‰„"}, new 
int[] {1, 1});
   assertAnalyzesTo(analyzerNoPunct, "่–ๅท็œŸๆ–—", new String[] {"่–ๅท", "็œŸ", "ๆ–—"}, new 
int[] {1, 1, 1});
   ```
   
   #### Built unidic-cwj-202302_full 
   
   I needed to increase memory before running the tests, as the 
`ConnectionCosts.dat` is ~700MB
   
   ```
   assertAnalyzesTo(analyzerNoPunct, "ใซใ˜ใ•ใ‚“ใ˜", new String[] {"ใซใ˜", "ใ•ใ‚“", "ใ˜"}, 
new int[] {1, 1, 1});
   assertAnalyzesTo(analyzerNoPunct, "ใกใ„ใ‹ใ‚", new String[] {"ใกใ„", "ใ‹ใ‚"}, new 
int[] {1, 1});
   assertAnalyzesTo(analyzerNoPunct, "ๆกƒๅคช้ƒŽ้›ป้‰„", new String[] {"ๆกƒ", "ๆกƒๅคช้ƒŽ", "ๅคช้ƒŽ", 
"้›ป้‰„"}, new int[] {1, 0, 1, 1});
   assertAnalyzesTo(analyzerNoPunct, "่–ๅท็œŸๆ–—", new String[] {"่–", "ๅท", "็œŸ", "ๆ–—"}, 
new int[] {1, 1, 1, 1});
   ```
   
   #### Built unidic-cwj-3.1.1-full
   ```
   assertAnalyzesTo(analyzerNoPunct, "ใซใ˜ใ•ใ‚“ใ˜", new String[] {"ใซใ˜", "ใ•", "ใ‚“", 
"ใ˜"}, new int[] {1, 1, 1, 1});
   assertAnalyzesTo(analyzerNoPunct, "ใกใ„ใ‹ใ‚", new String[] {"ใกใ„", "ใ‹ใ‚"}, new 
int[] {1, 1});
   assertAnalyzesTo(analyzerNoPunct, "ๆกƒๅคช้ƒŽ้›ป้‰„", new String[] {"ๆกƒๅคช้ƒŽ", "้›ป้‰„"}, new 
int[] {1, 1});
   assertAnalyzesTo(analyzerNoPunct, "่–ๅท็œŸๆ–—", new String[] {"่–", "ๅท", "็œŸๆ–—"}, new 
int[] {1, 1, 1});
   ```
   
   #### Built unidic-mecab-2.1.2_src
   ```
   assertAnalyzesTo(analyzerNoPunct, "ใซใ˜ใ•ใ‚“ใ˜", new String[] {"ใซใ˜", "ใ•ใ‚“", "ใ˜"}, 
new int[] {1, 1, 1});
   assertAnalyzesTo(analyzerNoPunct, "ใกใ„ใ‹ใ‚", new String[] {"ใกใ„", "ใ‹", "ใ‚"}, new 
int[] {1, 1, 1});
   assertAnalyzesTo(analyzerNoPunct, "ๆกƒๅคช้ƒŽ้›ป้‰„", new String[] {"ๆกƒ", "ๆกƒๅคช้ƒŽ", "ๅคช้ƒŽ", 
"้›ป้‰„"}, new int[] {1, 0, 1, 1});
   assertAnalyzesTo(analyzerNoPunct, "่–ๅท็œŸๆ–—", new String[] {"่–", "ๅท", "็œŸ", "ๆ–—"}, 
new int[] {1, 1, 1, 1});
   ```
   
   #### Built mecab-ipadic-2.7.0-20070801
   ```
    assertAnalyzesTo(analyzerNoPunct, "ใซใ˜ใ•ใ‚“ใ˜", new String[] {"ใซ", "ใ˜ใ•", "ใ‚“", 
"ใ˜"}, new int[] {1, 1, 1, 1});
    assertAnalyzesTo(analyzerNoPunct, "ใกใ„ใ‹ใ‚", new String[] {"ใกใ„", "ใ‹", "ใ‚"}, 
new int[] {1, 1, 1});
    assertAnalyzesTo(analyzerNoPunct, "ๆกƒๅคช้ƒŽ้›ป้‰„", new String[] {"ๆกƒๅคช้ƒŽ", "้›ป้‰„"}, new 
int[] {1, 1});
    assertAnalyzesTo(analyzerNoPunct, "่–ๅท็œŸๆ–—", new String[] {"่–ๅท", "็œŸ", "ๆ–—"}, 
new int[] {1, 1, 1});
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to