[GitHub] [lucene-solr] msokolov commented on issue #892: LUCENE-8972: Add ICUTransformCharFilter, to support pre-tokenizer ICU text transformation

GitBox Tue, 10 Dec 2019 10:02:54 -0800

msokolov commented on issue #892: LUCENE-8972: Add ICUTransformCharFilter, to 
support pre-tokenizer ICU text transformation
URL: https://github.com/apache/lucene-solr/pull/892#issuecomment-564156997
 
 
   OK, so you are good with the change I posted? If you don't see any issues,
   I will go ahead and commit after beasting again to make sure there are no
   more test failures due to randomization -- I think I was able to root them
   out.
   
   Regarding further automatic performance optimization, I think that is very
   ambitious and maybe too much magic. But let's see if it is needed in
   practice; if  a common "manual" pattern develops we can encode it as a
   convenience later.
   
   -Mike
   
   On Mon, Dec 9, 2019 at 7:30 AM Michael Gibney <notificati...@github.com>
   wrote:
   
   > Ah, I see. The refactoring looks good.
   >
   > Separately (and wrt the test-failures induced by randomization): the
   > withoutUnicodeNormalization randomization provides a good opportunity to
   > make some documentation more explicit (and/or build more nuance/complexity
   > into the handling of the assumeExternalUnicodeNormalization flag.
   >
   > The assumeExternalUnicodeNormalization arg cannot simply be flipped in
   > isolation without affecting behavior. My conception of this arg is as an
   > "expert" option that can significantly improve performance, but requires
   > users to know (and externally compensate for, with specific
   > ICUNormalizer2CharFilters in the top-level analysis chain) the specific
   > type of input and output unicode normalization applied by a *particular*
   > transliterator.
   >
   > It's true that this makes it possible for users to "shoot themselves in
   > the foot". One alternative would be to parse unicode normalization out of
   > the Transliterator and effectively "rewrite" the analysis chain, pulling
   > internal unicode normalization out to the top level as offset-aware
   > ICUNormalizer2CharFilters. Pulling at this thread though leads to other
   > potential considerations, such as:
   >
   >    1. I looked, and couldn't find any precedent for this type of
   >    under-the-hood analysis-chain rewriting.
   >    2. Should we prune redundant unicode normalization induced by such
   >    rewriting? Particularly if the redundancy is introduced by rewriting
   >    adjacent Transliterators?
   >    3. The more we attempt to handle for the user, the more we encourage
   >    users to think that this is a one-size-fits-all drop-in replacement. I
   >    actually *do* think it would be possible to design in this way, but
   >    I'm not sure I'd be prepared to guarantee it and my sense is that really
   >    doing it "right" -- flattening, respecting nested filters, propagating
   >    offsets -- could become quite complex.
   >
   > Might we consider proceeding in two phases? One with
   > assumeExternalUnicodeNormalization conceived as an expert option that
   > doesn't attempt anything more than disabling transliterator i/o
   > normailzation, and a potential future step that would attempt rewriting
   > composite (potentially hierarchical) Transliterators as top-level
   > CharFilters (with all the complexity that might entail)? For the first
   > phase, that could mean leaving tests as they were before the patch
   > introducing the randomized disabling of normalization, and adding
   > additional tests that explicitly disable normalization and compensate with
   > top-level normalization as appropriate for the associated transliterator.
   >
   > —
   > You are receiving this because you were mentioned.
   > Reply to this email directly, view it on GitHub
   > 
<https://github.com/apache/lucene-solr/pull/892?email_source=notifications&email_token=AAHHUQLY3SZOAPQ47MNLHRTQXZQB3A5CNFSM4IYSJJB2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEGJSW6Y#issuecomment-563293051>,
   > or unsubscribe
   > 
<https://github.com/notifications/unsubscribe-auth/AAHHUQLH3VKCUQONTZJQONDQXZQB3ANCNFSM4IYSJJBQ>
   > .
   >


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] msokolov commented on issue #892: LUCENE-8972: Add ICUTransformCharFilter, to support pre-tokenizer ICU text transformation

Reply via email to