[ https://issues.apache.org/jira/browse/LUCENE-10049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Uihyun Kim updated LUCENE-10049: -------------------------------- Attachment: LUCENE-10049.patch > part of speech tagging for Korean, Japanese > ------------------------------------------- > > Key: LUCENE-10049 > URL: https://issues.apache.org/jira/browse/LUCENE-10049 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis > Reporter: Uihyun Kim > Priority: Trivial > Labels: newbie > Attachments: LUCENE-10049.patch > > > Korean(nori) and Japanese(kuromoji) analyzers behave the same way by using a > dictionary-based and finite-state-based approach to identify words (aka > tokens). > When analyzing Korean/Japanese inputs, it needs to perform a lookup in the > dictionary on every character in order to build the lattice of all possible > segmentations. In order to achieve this efficiently, we encode the full > vocabulary in FST (finite state transducer). So we can analyze text using the > Viterbi algorithm to find the most likely segmentation (called the Viterbi > path) of any input written in Korean or Japanese. > > {code:java} > org.apache.lucene.analysis.ko.GraphvizFormatter > org.apache.lucene.analysis.ja.GraphvizFormatter > {code} > > Those two already have Graphviz to visualize the Viterbi lattice built from > input texts. However, according to my experience, part of speech is > significant to diagnose why the outputs look like since this works with the > dictionary-based approach. > Adding tokens' part of speech will help users to understand the analyzers. I > and some users are using those classes during their Lucene-related projects > although it's a very trivial part. Opening a PR after issue review. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org