[GitHub] [lucene] magibney commented on pull request #15: LUCENE-8972: Add ICUTransformCharFilter, to support pre-tokenizer ICU text transformation

GitBox Mon, 15 Mar 2021 09:56:56 -0700


magibney commented on pull request #15:
URL: https://github.com/apache/lucene/pull/15#issuecomment-799579642



   Thanks for looking at this, @rmuir! And no worries wrt "questions already 
answered" -- it's been long enough that this all feels fresh to me again, for 
better or worse :-)
   
   The Traditional->Simplified use case is a good example of where this type of 
functionality (however accomplished) is really necessary, because of the 
significance of dictionary-based tokenization for these scripts. It sounds 
promising to map the transliteration file onto a file for MappingCharFilter. 
Based on what you say, it looks like that would be viable for 
Traditional->Simplified. I think that approach would also address some of the 
weird issues this PR had to work around wrt offset resolution in composite 
transliterators. I wonder whether the same approach could apply to other 
scripts that commonly use dictionary-based tokenization?
   
   Perhaps this is what you were suggesting, but if the "MappingCharFilter" 
approach were used as an optimization, while still preserving generic 
ICUTransformCharFilter directly backed by ICU Transliterator, it'd be possible 
to test for consistency between the two approaches, while preserving "vanilla" 
ICUTransformCharFilter for unanticipated use cases, etc. ...
   
   True, `assumeExternalUnicodeNormalization` gets you into "shoot yourself in 
the foot" territory for sure, but results in roughly 4x performance improvement 
(based on some naive but I think representative benchmarks). [This 
comment](https://github.com/apache/lucene-solr/pull/892#issuecomment-537538822) 
describes the approach, and the [comment immediately 
following](https://github.com/apache/lucene-solr/pull/892#issuecomment-537601926)
 confirms results. Regarding which transformations this applies to (from the 
first linked comment):
   >NFC, as a trailing transformation step, is both very common and very active 
-- active in the sense that it will in many common contexts block output 
waiting for combining diacritics for literally almost every character
   I'm inclined to think the significant performance gain for such a common 
case is worth it -- as a user I'd certainly not want that type of functionality 
hidden from me. I wonder if there's a way to "have cake and eat it too" ...
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] magibney commented on pull request #15: LUCENE-8972: Add ICUTransformCharFilter, to support pre-tokenizer ICU text transformation

Reply via email to