[GitHub] [lucene] magibney commented on pull request #15: LUCENE-8972: Add ICUTransformCharFilter, to support pre-tokenizer ICU text transformation

GitBox Mon, 15 Mar 2021 14:06:19 -0700


magibney commented on pull request #15:
URL: https://github.com/apache/lucene/pull/15#issuecomment-799752136



   Yeah; it's been a while since I thought about this, but I think there's an 
inherent challenge in the way Transliterators are composable (can be chained 
together), but also kind of atomic/black-box (assume nothing about input or 
output and so internally do whatever manipulation they need in order to get the 
text in/out of a form they're designed to work with). iirc the "black box" 
Transliterators that internally construct composite Transliterators don't have 
their component parts cleanly exposed, so if one wants the functionality 
provided by a given Transliterator, the only (practical) option is to accept 
whatever higher-level baggage is bundled with it.
   
   More accurate offsets, and probably better performance, could I think be 
achieved by "decomposing" composite transliterators (and some fancy footwork). 
But I remember feeling convinced that it would be really hard to do this 
"right" (see [last comment on parent 
PR](https://github.com/apache/lucene-solr/pull/892#issuecomment-567644194)) -- 
and that although it probably _could_ be done in Lucene, it would probably be 
more appropriately done in ICU. I think this would amount to a substantial 
change (addition?) to the ICU Transliterator API/implementation, which 
currently doesn't track offsets at all, nor the way filters can kind of 
"bypass" elements of the decomposed transliterator chain.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] magibney commented on pull request #15: LUCENE-8972: Add ICUTransformCharFilter, to support pre-tokenizer ICU text transformation

Reply via email to