[GitHub] [lucene] maomao905 commented on issue #11976: End offset for compatibility characters is not incremented with ICUNormalizer2CharFilter

GitBox Fri, 25 Nov 2022 18:49:20 -0800


maomao905 commented on issue #11976:
URL: https://github.com/apache/lucene/issues/11976#issuecomment-1327966458


   Ok, I tried to use tokenfilter instead.
   The compatibility character (`㋀`) is not tokenized in `icu_tokenizer` and 
excluded from the result.
   It seems that the compatibility characters need to be normalized before 
tokening texts.
   
   <details>
   <summary>Elasticsearch request</summary>
   <div>
   
   ```sh
   $ curl "http://${ES_URL}/_analyze?pretty"; \
    -H 'Content-Type: application/json' \
    -d '{"tokenizer" : "icu_tokenizer", "char_filter" : [], "filter": 
["icu_normalizer"], "text" : "日日㋀日", "explain": true}'
   {
     "detail" : {
       "custom_analyzer" : true,
       "charfilters" : [ ],
       "tokenizer" : {
         "name" : "icu_tokenizer",
         "tokens" : [
           {
             "token" : "日",
             "start_offset" : 0,
             "end_offset" : 1,
             "type" : "<IDEOGRAPHIC>",
             "position" : 0,
             "bytes" : "[e6 97 a5]",
             "positionLength" : 1,
             "script" : "Chinese/Japanese",
             "termFrequency" : 1
           },
           {
             "token" : "日",
             "start_offset" : 1,
             "end_offset" : 2,
             "type" : "<IDEOGRAPHIC>",
             "position" : 1,
             "bytes" : "[e6 97 a5]",
             "positionLength" : 1,
             "script" : "Chinese/Japanese",
             "termFrequency" : 1
           },
           {
             "token" : "日",
             "start_offset" : 3,
             "end_offset" : 4,
             "type" : "<IDEOGRAPHIC>",
             "position" : 2,
             "bytes" : "[e6 97 a5]",
             "positionLength" : 1,
             "script" : "Chinese/Japanese",
             "termFrequency" : 1
           }
         ]
       },
       "tokenfilters" : [
         {
           "name" : "icu_normalizer",
           "tokens" : [
             {
               "token" : "日",
               "start_offset" : 0,
               "end_offset" : 1,
               "type" : "<IDEOGRAPHIC>",
               "position" : 0,
               "bytes" : "[e6 97 a5]",
               "positionLength" : 1,
               "script" : "Chinese/Japanese",
               "termFrequency" : 1
             },
             {
               "token" : "日",
               "start_offset" : 1,
               "end_offset" : 2,
               "type" : "<IDEOGRAPHIC>",
               "position" : 1,
               "bytes" : "[e6 97 a5]",
               "positionLength" : 1,
               "script" : "Chinese/Japanese",
               "termFrequency" : 1
             },
             {
               "token" : "日",
               "start_offset" : 3,
               "end_offset" : 4,
               "type" : "<IDEOGRAPHIC>",
               "position" : 2,
               "bytes" : "[e6 97 a5]",
               "positionLength" : 1,
               "script" : "Chinese/Japanese",
               "termFrequency" : 1
             }
           ]
         }
       ]
     }
   }
   ```
   </div>
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [lucene] maomao905 commented on issue #11976: End offset for compatibility characters is not incremented with ICUNormalizer2CharFilter

Reply via email to