Dear Ryan Thank you very much for the kind, detailed explanation which I will study carefully. It was not my intention to reply to you off-list. I hope I have correctly addressed this reply to appear on-list.
Peter > On 10 Jul 2020, at 15:47, Ryan Culpepper <[email protected]> wrote: > > (I see this went off the mailing list. If you reply, please consider CCing > the list.) > > Yes, I understood your goal of trying to capture the notion of Unicode > "alphabetic" characters with a regular expression. > > As far as I know, Unicode doesn't have a notion of "alphabetic", but it does > assign every code point to a "General category", consisting of a main > category and a subcategory. There is a category called "Letter", which seems > like one reasonable generalization of "alphabetic". > > In Racket, you can get the code for a character's category using > `char-general-category`. For example: > > > (char-general-category #\A) > 'lu > > (char-general-category #\é) > 'll > > (char-general-category #\ß) > 'll > > (char-general-category #\7) > 'nd > > The general category for "A" is "Letter, uppercase", which has the code "Lu", > which Racket turns into the symbol 'lu. The general category of "é" is > "Letter, lowercase", code "Ll", which becomes 'll. The general category of > "7" is "Number, decimal digit", code "Nd". > > In Racket regular expressions, the \p{category} syntax lets you recognize > characters from a specific category. For example, \p{Lu} recognizes > characters with the category "Letter, uppercase", and \p{L} recognizes > characters with the category "Letter", which is the union of "Letter, > uppercase", "Letter, lowercase", and so on. > > So the regular expression #px"^\\p{L}+$" recognizes sequences of one or more > Unicode letters. For example: > > > (regexp-match? #px"^\\p{L}+$" "héllo") > #t > > (regexp-match? #px"^\\p{L}+$" "straße") > #t > > (regexp-match? #px"^\\p{L}+$" "二の句") > #t > > (regexp-match? #px"^\\p{L}+$" "abc123") > #f ;; No, contains numbers > > There are still some problems to watch out for, though. For example, accented > characters like "é" can be expressed as a single pre-composed code point or > "decomposed" into a base letter and a combining mark. You can get the > decomposed form by converting the string to "decomposed normal form" (NFD), > and the regexp above won't match that string. > > > (map char-general-category (string->list (string-normalize-nfd "é"))) > '(ll mn) > > (regexp-match? #px"^\\p{L}+$" (string-normalize-nfd "héllo")) > #f > > One fix would be to call `string-normalize-nfc` first, but some > letter-modifier pairs don't have pre-composed versions. Another fix would be > to expand the regexp to include modifiers. You'd have to decide which is > better based on your application. > > Ryan > > > > On Fri, Jul 10, 2020 at 2:10 AM Peter W A Wood <[email protected]> wrote: > Ryan > > > On 9 Jul 2020, at 22:52, Ryan Culpepper <[email protected]> wrote: > > > > If you want a regular expression that does match the example string, you > > can use the \p{property} notation. For example: > > > > > (regexp-match? #px"^\\p{L}+$" "h\uFFC3\uFFA9llo") > > #t > > > > The "Regexp Syntax" docs have a grammar for regular expressions with links > > to examples. > > > > Ryan > > Thanks. I used héllo as an example. I was wondering if there was a way of > specifying a regular expression group for Unicode “alphabetic” characters. > > On reflection, it seems a somewhat esoteric requirement that is almost > impossible to satisfy. By way of example, would > “Straße" be considered alphabetic? Would “二の句” be considered alphabetic? > > Strangely, Python considered the Japanese characters as being alphabetic but > will not accept “Straße” as a valid string. (I suspect this is due to some > problem relating to Locale.. > > >>> "二の句".isalpha() > True > >>> “Straße".isalpha() > File "<stdin>", line 1 > “Straße".isalpha() > ^ > SyntaxError: invalid character in identifier > > Clearly, trying to identify “Unicode” alphabetic characters is far from > straightforward, though it may well be useful in processing some language > texts. > > Peter > > -- You received this message because you are subscribed to the Google Groups "Racket Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/racket-users/BC855B5D-80BF-458B-A2D2-9570B0436646%40gmail.com.

