Re: [racket-users] Are Regular Expression classes Unicode aware?

Peter W A Wood Fri, 10 Jul 2020 03:29:32 -0700

Dear Ryan

Thank you very much for the kind, detailed explanation which I will study 
carefully. It was not my intention to reply to you off-list. I hope I have 
correctly addressed this reply to appear on-list.


Peter

> On 10 Jul 2020, at 15:47, Ryan Culpepper <[email protected]> wrote:
> 
> (I see this went off the mailing list. If you reply, please consider CCing 
> the list.)
> 
> Yes, I understood your goal of trying to capture the notion of Unicode 
> "alphabetic" characters with a regular expression.
> 
> As far as I know, Unicode doesn't have a notion of "alphabetic", but it does 
> assign every code point to a "General category", consisting of a main 
> category and a subcategory. There is a category called "Letter", which seems 
> like one reasonable generalization of "alphabetic".
> 
> In Racket, you can get the code for a character's category using 
> `char-general-category`. For example:
> 
>   > (char-general-category #\A)
>   'lu
>   > (char-general-category #\é)
>   'll
>   > (char-general-category #\ß)
>   'll
>   > (char-general-category #\7)
>   'nd
> 
> The general category for "A" is "Letter, uppercase", which has the code "Lu", 
> which Racket turns into the symbol 'lu. The general category of "é" is 
> "Letter, lowercase", code "Ll", which becomes 'll. The general category of 
> "7" is "Number, decimal digit", code "Nd".
> 
> In Racket regular expressions, the \p{category} syntax lets you recognize 
> characters from a specific category. For example, \p{Lu} recognizes 
> characters with the category "Letter, uppercase", and \p{L} recognizes 
> characters with the category "Letter", which is the union of "Letter, 
> uppercase", "Letter, lowercase", and so on.
> 
> So the regular expression #px"^\\p{L}+$" recognizes sequences of one or more 
> Unicode letters. For example:
> 
>   > (regexp-match? #px"^\\p{L}+$" "héllo")
>   #t
>   > (regexp-match? #px"^\\p{L}+$" "straße")
>   #t
>   > (regexp-match? #px"^\\p{L}+$" "二の句")
>   #t
>   > (regexp-match? #px"^\\p{L}+$" "abc123")
>   #f ;; No, contains numbers
> 
> There are still some problems to watch out for, though. For example, accented 
> characters like "é" can be expressed as a single pre-composed code point or 
> "decomposed" into a base letter and a combining mark. You can get the 
> decomposed form by converting the string to "decomposed normal form" (NFD), 
> and the regexp above won't match that string.
> 
>   > (map char-general-category (string->list (string-normalize-nfd "é")))
>   '(ll mn)
>   > (regexp-match? #px"^\\p{L}+$" (string-normalize-nfd "héllo"))
>   #f
> 
> One fix would be to call `string-normalize-nfc` first, but some 
> letter-modifier pairs don't have pre-composed versions. Another fix would be 
> to expand the regexp to include modifiers. You'd have to decide which is 
> better based on your application.
> 
> Ryan
> 
> 
> 
> On Fri, Jul 10, 2020 at 2:10 AM Peter W A Wood <[email protected]> wrote:
> Ryan
> 
> > On 9 Jul 2020, at 22:52, Ryan Culpepper <[email protected]> wrote:
> > 
> > If you want a regular expression that does match the example string, you 
> > can use the \p{property} notation. For example:
> > 
> >   > (regexp-match? #px"^\\p{L}+$" "h\uFFC3\uFFA9llo")
> >   #t
> > 
> > The "Regexp Syntax" docs have a grammar for regular expressions with links 
> > to examples.
> > 
> > Ryan
> 
> Thanks. I used héllo as an example. I was wondering if there was a way of 
> specifying a regular expression group for Unicode “alphabetic” characters. 
> 
> On reflection, it seems a somewhat esoteric requirement that is almost 
> impossible to satisfy. By way of example, would 
> “Straße" be considered alphabetic? Would “二の句” be considered alphabetic?
> 
> Strangely, Python considered the Japanese characters as being alphabetic but 
> will not accept “Straße” as a valid string. (I suspect this is due to some 
> problem relating to Locale..
> 
>  >>> "二の句".isalpha()
> True
> >>> “Straße".isalpha()
>   File "<stdin>", line 1
>     “Straße".isalpha()
>           ^
> SyntaxError: invalid character in identifier
> 
> Clearly, trying to identify “Unicode” alphabetic characters is far from 
> straightforward, though it may well be useful in processing some language 
> texts.
> 
> Peter
> 
> 

-- 
You received this message because you are subscribed to the Google Groups 
"Racket Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/racket-users/BC855B5D-80BF-458B-A2D2-9570B0436646%40gmail.com.

Re: [racket-users] Are Regular Expression classes Unicode aware?

Reply via email to