twosom opened a new issue, #14940:
URL: https://github.com/apache/lucene/issues/14940

   ### Description
   
   ## Overview
   This issue proposes adding metadata support to the Nori Korean analyzer, 
allowing users to attach additional information to dictionary words that can be 
accessed during text analysis.
   
   ## Background and Motivation
   Currently, the Nori analyzer allows users to register words in a custom 
dictionary, but there's no way to associate additional information with these 
words. By supporting metadata, we can enable:
   - Attaching semantic category information to words (e.g., "Java" -> 
"programming language")
   - Preserving information for compound words
   - Custom tagging and classification
   - Domain-specific annotations
   
   ## Proposed Implementation
   1. Add metadata support to Token class
   2. Create `MetadataAttribute` and its implementation
   3. Extend user dictionary format with a metadata separator (`>>`)
   4. Preserve metadata during compound word decomposition
   
   ## Usage Example
   User dictionary:
   ```
   자바 >> computer language 
   java >> computer language 
   엘라스틱서치 엘라스틱 서치 >> search engine
   ```
   
   and this should be
   
   input : 자바
   ```
   /* Output:
   Term: 자바
   Metadata: computer language
   POS: NNG
   ---
   ```
   
   input : 엘라스틱서치
   ```
   /* Output:
   Term: 엘라스틱서치
   Metadata: search engine
   Position Increment: 1
   Position Length: 2
   ---
   Term: 엘라스틱
   Metadata: search engine
   Position Increment: 0
   Position Length: 1
   ---
   Term: 서치
   Metadata: search engine
   Position Increment: 1
   Position Length: 1
   ---
   ```
   
   
   
   ## Benefits
   1. Enhanced information modeling: Attach additional information to words to 
improve search quality
   2. Domain-specific analysis: Define metadata relevant to specific domains
   3. Custom dictionary extension: Add capabilities while maintaining backward 
compatibility


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to