date:20220310

[jira] [Commented] (LUCENE-10393) Should we unify the dictionary builder/loader of kuromoji and nori?

2022-03-10 Thread Tomoko Uchida (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17504211#comment-17504211
 ] 

Tomoko Uchida commented on LUCENE-10393:


Both Kuromoji and Nori have `BinaryDictionary` and `BinaryDictionaryWriter` 
classes, and there is significant code duplication. This PR unifies them by 
decoupling language-specific information (or morphological information) from 
the base dictionary interface.

[https://github.com/apache/lucene/pull/740]

This is fairly large (in order to make it self-contained) but in a nutshell, 
there are two conceptual interfaces:
 - Dictionary: a high-level interface parameterized by a specific 
MorphAttributes
 - MorphAttributes: a high-level interface that represents morphological 
information. This is supposed to be extended to hold language-specific details.

and base classes that have common logic in kuromoji and nori:
 - BinaryDictionary: abstract base class for the dictionary lookup operation
 - BinaryDictionaryWriter: abstract base class for writing dictionary files

Those classes reside in the analyzers-common module; I added 
`org.apache.lucene.analysis.morph` package to it.

Then, each concrete dictionary class can be rewritten by extending the above 
interfaces. For example,
 - Kuromoji's `TokenInforDictionary` is a `BinaryDictionary` that is bounded to 
`TokenInfoMorphAttributes` (an instance of `JaMorphAttributes`).
 - Nori's `UnknownDictionary` is a `BinaryDictionary` that is bounded to 
`UnknownMorphAttributes` (an instance of `KoMorphAttributes`)

The main points of the PR are reducing code duplication and sorting out the 
interfaces. While Kuromoji and Nori have been independently evolved so far, 
they are still conceptually the same, and I think re-unifying them at some 
level may be good for future development and bug fixes.

> Should we unify the dictionary builder/loader of kuromoji and nori?
> ---
>
> Key: LUCENE-10393
> URL: https://issues.apache.org/jira/browse/LUCENE-10393
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Tomoko Uchida
>Priority: Major
>  Time Spent: 2h 50m
>  Remaining Estimate: 0h
>
> A spin-off from LUCENE-8816.
> Kuromoji and Nori have many duplicated code in their dictionary 
> builder/loader and we occasionally have to maintain both of them; I'd like to 
> explore the possibility of their unification at some level.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] rmuir commented on pull request #740: LUCENE-10393: Unify binary dictionary and dictionary writer in kuromoji and nori

2022-03-10 Thread GitBox



rmuir commented on pull request #740:
URL: https://github.com/apache/lucene/pull/740#issuecomment-1064044898


   I only looked at the high-level design so far, this seems to be a good 
approach @mocobeta ! Thank you for tackling it. I think the bottom-up approach 
is a good one, and splitting out the morphological data into separate interface 
makes sense to me. 
   
   I would suggest reconsidering the name `MorphAttributes`, mostly because 
"Attributes" already has a complex meaning within lucene analysis. Some 
possibilities (not exhaustive list):
   * `MorphData`
   * `DictionaryData`
   
   I will do more review and testing, I am digging into it in detail.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] kkewwei opened a new pull request #741: LUCENE-9998: avoid the instant writing rate bigger than the limited rate in merge process

2022-03-10 Thread GitBox

kkewwei opened a new pull request #741:
URL: https://github.com/apache/lucene/pull/741

# Description

In the merge write process, if there is a long interval between two chunk
writes, then the second chunk write will be not paused, as the result, the
instant writing rate of the second chunk is high, which is far more than the
limited rate.

# Tests

Added TestRateLimitedIndexOutput.

# Checklist

Please review the following and check all that apply:

- [ ] I have reviewed the guidelines for [How to
Contribute](https://wiki.apache.org/lucene/HowToContribute) and my code
conforms to the standards described there to the best of my ability.
- [ ] I have created a Jira issue and added the issue ID to my pull request
title.
- [ ] I have given Lucene maintainers
[access](https://help.github.com/en/articles/allowing-changes-to-a-pull-request-branch-created-from-a-fork)
to contribute to my PR branch. (optional but recommended)
- [ ] I have developed this patch against the `main` branch.
- [ ] I have run `./gradlew check`.
- [ ] I have added tests for my changes.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

38 matches

Mail list logo