rmuir commented on pull request #15:
URL: https://github.com/apache/lucene/pull/15#issuecomment-806079907


   > I'm curious what you'll make of 
[0d8c001](https://github.com/apache/lucene/commit/0d8c001d88bda373fe321550a452c4dd53a3af74)
 ... the previous state was kind of weird because we were ostensibly 
"detecting" norm ids that never actually cropped up in practice, but then 
throwing an `UnsupportedOperationException` if we had ever come to the point of 
trying to replace them. This worked because of the fact that they never cropped 
up in practice. I'm pretty sure that the change introduced in 
[0d8c001](https://github.com/apache/lucene/commit/0d8c001d88bda373fe321550a452c4dd53a3af74)
 would work fine, but at the moment it's definitely not covered by tests.
   > 
   > Alternatives to 
[0d8c001](https://github.com/apache/lucene/commit/0d8c001d88bda373fe321550a452c4dd53a3af74)
 would be:
   > 
   >     1. stop detecting the strings FCC, FCD, and NFKC_CF (i.e. don't 
recognize them as candidates for replacement/optimization)
   
   either current code or option 1 is fine. Honestly, no one will ever use 
these.
   
   users use normalization in their rules because they want "one" rule to 
capture the transformation, e.g:
   ```
   # alef w/ hamza below
   \u0625 -> i;
   ```
   
   * they dont want to write duplicate rules to handle decomposed case (e.g. 
0627 + 0655)
   * they dont want explosion of rules to handle diacritics ordering (NFC/NFD 
enforce an order by combining class)
   
   So in most cases, NFC or NFD is useful. Whether the person picks NFC or NFD 
depends more on what the standard/rulesystem is supposed to do, and how the 
writing system works, or in some cases maybe just arbitrary. In the case of 
Korean, if we use NFD and work on Jamo, it will be tiny amount of rules (work 
on characters, like an alphabet). But if we use NFC we would need like 11,000 
rules, one for each syllable.
   
   For some writing systems, there may be legacy compatibility cases, just 
designed for round-tripping back to old charsets. In our arabic example here, 
these exist, and you might see them if you extract from PDF (e.g. FE87, FE88).
   So in these cases, NFKC or NFKD is a better choice.
   
   But nobody will need fast c&d (this is for collation i think?) or nfkc_cf 
here (usually if there is capitalization, you tend to preserve that in the 
rules).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to