[jira] [Created] (LUCENE-10049) part of speech tagging for Korean, Japanese

Uihyun Kim (Jira) Fri, 13 Aug 2021 16:59:06 -0700

Uihyun Kim created LUCENE-10049:
-----------------------------------

             Summary: part of speech tagging for Korean, Japanese
                 Key: LUCENE-10049
                 URL: https://issues.apache.org/jira/browse/LUCENE-10049
             Project: Lucene - Core
          Issue Type: Improvement
          Components: modules/analysis
            Reporter: Uihyun Kim
         Attachments: LUCENE-10049.patch


Korean(nori) and Japanese(kuromoji) analyzers behave the same way by using a 
dictionary-based and finite-state-based approach to identify words (aka tokens).

When analyzing Korean/Japanese inputs, it needs to perform a lookup in the 
dictionary on every character in order to build the lattice of all possible 
segmentations. In order to achieve this efficiently, we encode the full 
vocabulary in FST (finite state transducer). So we can analyze text using the 
Viterbi algorithm to find the most likely segmentation (called the Viterbi 
path) of any input written in Korean or Japanese.

 
{code:java}
org.apache.lucene.analysis.ko.GraphvizFormatter
org.apache.lucene.analysis.ja.GraphvizFormatter
{code}
 

Those two already have Graphviz to visualize the Viterbi lattice built from 
input texts. However, according to my experience, part of speech is significant 
to diagnose why the outputs look like since this works with the 
dictionary-based approach. 

Adding tokens' part of speech will help users to understand the analyzers. I 
and some users are using those classes during their Lucene-related projects 
although it's a very trivial part. Opening a PR after issue review.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Created] (LUCENE-10049) part of speech tagging for Korean, Japanese

Reply via email to