[jira] [Updated] (LUCENE-10049) part of speech tagging for Korean, Japanese

Uihyun Kim (Jira) Fri, 13 Aug 2021 16:59:21 -0700


     [ 
https://issues.apache.org/jira/browse/LUCENE-10049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Uihyun Kim updated LUCENE-10049:
--------------------------------
    Attachment: LUCENE-10049.patch

> part of speech tagging for Korean, Japanese
> -------------------------------------------
>
>                 Key: LUCENE-10049
>                 URL: https://issues.apache.org/jira/browse/LUCENE-10049
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/analysis
>            Reporter: Uihyun Kim
>            Priority: Trivial
>              Labels: newbie
>         Attachments: LUCENE-10049.patch
>
>
> Korean(nori) and Japanese(kuromoji) analyzers behave the same way by using a 
> dictionary-based and finite-state-based approach to identify words (aka 
> tokens).
> When analyzing Korean/Japanese inputs, it needs to perform a lookup in the 
> dictionary on every character in order to build the lattice of all possible 
> segmentations. In order to achieve this efficiently, we encode the full 
> vocabulary in FST (finite state transducer). So we can analyze text using the 
> Viterbi algorithm to find the most likely segmentation (called the Viterbi 
> path) of any input written in Korean or Japanese.
>  
> {code:java}
> org.apache.lucene.analysis.ko.GraphvizFormatter
> org.apache.lucene.analysis.ja.GraphvizFormatter
> {code}
>  
> Those two already have Graphviz to visualize the Viterbi lattice built from 
> input texts. However, according to my experience, part of speech is 
> significant to diagnose why the outputs look like since this works with the 
> dictionary-based approach. 
> Adding tokens' part of speech will help users to understand the analyzers. I 
> and some users are using those classes during their Lucene-related projects 
> although it's a very trivial part. Opening a PR after issue review.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Updated] (LUCENE-10049) part of speech tagging for Korean, Japanese

Reply via email to