[GitHub] [lucene] rmuir commented on pull request #15: LUCENE-8972: Add ICUTransformCharFilter, to support pre-tokenizer ICU text transformation

GitBox Sun, 14 Mar 2021 23:22:35 -0700


rmuir commented on pull request #15:
URL: https://github.com/apache/lucene/pull/15#issuecomment-799147946

thanks for updating the PR! Its late tonight, but I will look at this this
week. I think I haven't looked since the patch so I am sure there are some
"interesting" things that had to be done for any kind of good performance...
Sorry if I ask dumb questions you have already answered...!

I think its good to have the charfilter especially for the more efficient
transforms like `XYZ>Latin`... I think we should mainly optimize the
transliterator code for those type of transforms.

The JIRA Issue discusses stuff like Traditional->Simplified as the use-case,
but I am not sure I would make major sacrifices to try to speedup
Traditional->Simplified. At least in the past this one was slow as a
transliterator, but looking at the rules, maybe it really shouldnt be a
transliterator at all:
https://github.com/unicode-org/cldr/blob/master/common/transforms/Simplified-Traditional.xml.
This file is like 4k lines of simple mappings (longest match) and out of
thousands, only 6 context-based rules are used (ScDigit: 16 chars in the set).
So as an alternative, we could process that file, expand those 6 rules
(resulting in 66 additional "lines"), and produce a text file suitable for
MappingCharFilter. It could be part of the ICU upgrade automation in the gradle
build so that it gets regenerated with new ICU versions.

I can see why disabling normalization in the output might improve
performance here, why normalize twice? But its a bit scary to provide such an
option that makes things "faster" but might result in crazy output and confuse
users or even make tokenizers behave worse, if they don't understand what it
means. Even then its still problematic, as the ICUNormalizer2CharFilter has
buggy offset handling, that the tests will find if you re-enable testRandom and
run it enough times: https://issues.apache.org/jira/browse/LUCENE-5595 . So I'm
not a fan of e.g. returning unexpected decomposed output from this thing right
now. For which transforms does this optimization impact the performance and how
much?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] rmuir commented on pull request #15: LUCENE-8972: Add ICUTransformCharFilter, to support pre-tokenizer ICU text transformation

Reply via email to