markharwood opened a new pull request #1489:
URL: https://github.com/apache/lucene-solr/pull/1489


   Jira Issue [9336](https://issues.apache.org/jira/browse/LUCENE-9336) 
proposes adding support for common regex character classes like `\w`.
   This PR adds the code to RegExp.java and associated tests.
   
   The implementation could have gone one of two ways:
   1) Extend `Kind` to introduce new types for DIGIT/WHITESPACE etc and 
corresponding case statements for each type to `make[Type]`,  rendering 
toString, toStringTree and toAutomaton or
   2) Reuse existing Kinds like range etc by adding a simple piece of logic to 
the parser to expand `\d` into the [documented 
equivalent](https://docs.oracle.com/javase/tutorial/essential/regex/pre_char_classes.html#CHART)
 ie `[0-9]`.
   
   I went for option 2 which makes the code shorter/cleaner and the meaning of 
expressions like `\d` more easily readable in the code. The downside is that 
the `toString` representations of these inputs are not as succinct - rendering 
the fully expanded character lists rather than the shorthand `\x` type inputs 
that generated them.
   Happy to change if we feel this is the wrong trade-off.
   
   One other consideration is that the shorthand expressions list could perhaps 
be made configurable e.g. `\h` might be shorthand used to represent hashtags of 
the form `#\w*` if that was something users routinely searched for and wanted 
to add to the regex vocabulary.
   
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to