magibney commented on pull request #15: URL: https://github.com/apache/lucene/pull/15#issuecomment-799752136
Yeah; it's been a while since I thought about this, but I think there's an inherent challenge in the way Transliterators are composable (can be chained together), but also kind of atomic/black-box (assume nothing about input or output and so internally do whatever manipulation they need in order to get the text in/out of a form they're designed to work with). iirc the "black box" Transliterators that internally construct composite Transliterators don't have their component parts cleanly exposed, so if one wants the functionality provided by a given Transliterator, the only (practical) option is to accept whatever higher-level baggage is bundled with it. More accurate offsets, and probably better performance, could I think be achieved by "decomposing" composite transliterators (and some fancy footwork). But I remember feeling convinced that it would be really hard to do this "right" (see [last comment on parent PR](https://github.com/apache/lucene-solr/pull/892#issuecomment-567644194)) -- and that although it probably _could_ be done in Lucene, it would probably be more appropriately done in ICU. I think this would amount to a substantial change (addition?) to the ICU Transliterator API/implementation, which currently doesn't track offsets at all, nor the way filters can kind of "bypass" elements of the decomposed transliterator chain. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org