[ 
https://issues.apache.org/jira/browse/LUCENE-10393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17504211#comment-17504211
 ] 

Tomoko Uchida commented on LUCENE-10393:
----------------------------------------

Both Kuromoji and Nori have `BinaryDictionary` and `BinaryDictionaryWriter` 
classes, and there is significant code duplication. This PR unifies them by 
decoupling language-specific information (or morphological information) from 
the base dictionary interface.

[https://github.com/apache/lucene/pull/740]

This is fairly large (in order to make it self-contained) but in a nutshell, 
there are two conceptual interfaces:
 - Dictionary: a high-level interface parameterized by a specific 
MorphAttributes
 - MorphAttributes: a high-level interface that represents morphological 
information. This is supposed to be extended to hold language-specific details.

and base classes that have common logic in kuromoji and nori:
 - BinaryDictionary: abstract base class for the dictionary lookup operation
 - BinaryDictionaryWriter: abstract base class for writing dictionary files

Those classes reside in the analyzers-common module; I added 
`org.apache.lucene.analysis.morph` package to it.

Then, each concrete dictionary class can be rewritten by extending the above 
interfaces. For example,
 - Kuromoji's `TokenInforDictionary` is a `BinaryDictionary` that is bounded to 
`TokenInfoMorphAttributes` (an instance of `JaMorphAttributes`).
 - Nori's `UnknownDictionary` is a `BinaryDictionary` that is bounded to 
`UnknownMorphAttributes` (an instance of `KoMorphAttributes`)

The main points of the PR are reducing code duplication and sorting out the 
interfaces. While Kuromoji and Nori have been independently evolved so far, 
they are still conceptually the same, and I think re-unifying them at some 
level may be good for future development and bug fixes.

> Should we unify the dictionary builder/loader of kuromoji and nori?
> -------------------------------------------------------------------
>
>                 Key: LUCENE-10393
>                 URL: https://issues.apache.org/jira/browse/LUCENE-10393
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/analysis
>            Reporter: Tomoko Uchida
>            Priority: Major
>          Time Spent: 2h 50m
>  Remaining Estimate: 0h
>
> A spin-off from LUCENE-8816.
> Kuromoji and Nori have many duplicated code in their dictionary 
> builder/loader and we occasionally have to maintain both of them; I'd like to 
> explore the possibility of their unification at some level.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to