rmuir commented on pull request #15:
URL: https://github.com/apache/lucene/pull/15#issuecomment-799147946


   thanks for updating the PR! Its late tonight, but I will look at this this 
week. I think I haven't looked since the patch so I am sure there are some 
"interesting" things that had to be done for any kind of good performance... 
Sorry if I ask dumb questions you have already answered...!
   
   I think its good to have the charfilter especially for the more efficient 
transforms like `XYZ>Latin`... I think we should mainly optimize the 
transliterator code for those type of transforms.
   
   The JIRA Issue discusses stuff like Traditional->Simplified as the use-case, 
but I am not sure I would make major sacrifices to try to speedup 
Traditional->Simplified. At least in the past this one was slow as a 
transliterator, but looking at the rules, maybe it really shouldnt be a 
transliterator at all: 
https://github.com/unicode-org/cldr/blob/master/common/transforms/Simplified-Traditional.xml.
 This file is like 4k lines of simple mappings (longest match) and out of 
thousands, only 6 context-based rules are used (ScDigit: 16 chars in the set). 
So as an alternative, we could process that file, expand those 6 rules 
(resulting in 66 additional "lines"), and produce a text file suitable for 
MappingCharFilter. It could be part of the ICU upgrade automation in the gradle 
build so that it gets regenerated with new ICU versions.
   
   I can see why disabling normalization in the output might improve 
performance here, why normalize twice?  But its a bit scary to provide such an 
option that makes things "faster" but might result in crazy output and confuse 
users or even make tokenizers behave worse, if they don't understand what it 
means. Even then its still problematic, as the ICUNormalizer2CharFilter has 
buggy offset handling, that the tests will find if you re-enable testRandom and 
run it enough times: https://issues.apache.org/jira/browse/LUCENE-5595 . So I'm 
not a fan of e.g. returning unexpected decomposed output from this thing right 
now. For which transforms does this optimization impact the performance and how 
much?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to