[GitHub] [lucene] rmuir commented on pull request #15: LUCENE-8972: Add ICUTransformCharFilter, to support pre-tokenizer ICU text transformation

GitBox Wed, 24 Mar 2021 12:03:22 -0700


rmuir commented on pull request #15:
URL: https://github.com/apache/lucene/pull/15#issuecomment-806079907

> I'm curious what you'll make of
[0d8c001](https://github.com/apache/lucene/commit/0d8c001d88bda373fe321550a452c4dd53a3af74)
... the previous state was kind of weird because we were ostensibly
"detecting" norm ids that never actually cropped up in practice, but then
throwing an `UnsupportedOperationException` if we had ever come to the point of
trying to replace them. This worked because of the fact that they never cropped
up in practice. I'm pretty sure that the change introduced in
[0d8c001](https://github.com/apache/lucene/commit/0d8c001d88bda373fe321550a452c4dd53a3af74)
would work fine, but at the moment it's definitely not covered by tests.
>
> Alternatives to
[0d8c001](https://github.com/apache/lucene/commit/0d8c001d88bda373fe321550a452c4dd53a3af74)
would be:
>
> 1. stop detecting the strings FCC, FCD, and NFKC_CF (i.e. don't
recognize them as candidates for replacement/optimization)

either current code or option 1 is fine. Honestly, no one will ever use
these.

users use normalization in their rules because they want "one" rule to
capture the transformation, e.g:
```
# alef w/ hamza below
\u0625 -> i;
```

* they dont want to write duplicate rules to handle decomposed case (e.g.
0627 + 0655)
* they dont want explosion of rules to handle diacritics ordering (NFC/NFD
enforce an order by combining class)

So in most cases, NFC or NFD is useful. Whether the person picks NFC or NFD
depends more on what the standard/rulesystem is supposed to do, and how the
writing system works, or in some cases maybe just arbitrary. In the case of
Korean, if we use NFD and work on Jamo, it will be tiny amount of rules (work
on characters, like an alphabet). But if we use NFC we would need like 11,000
rules, one for each syllable.

For some writing systems, there may be legacy compatibility cases, just
designed for round-tripping back to old charsets. In our arabic example here,
these exist, and you might see them if you extract from PDF (e.g. FE87, FE88).
So in these cases, NFKC or NFKD is a better choice.

But nobody will need fast c&d (this is for collation i think?) or nfkc_cf
here (usually if there is capitalization, you tend to preserve that in the
rules).

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] rmuir commented on pull request #15: LUCENE-8972: Add ICUTransformCharFilter, to support pre-tokenizer ICU text transformation

Reply via email to