Great, I'm glad it was useful! Ryan
On Sat, Jul 11, 2020 at 12:27 PM Peter W A Wood <[email protected]> wrote: > Dear Ryan > > Thank you for both your full, complete and understandable explanation and > a working solution which is more than sufficient for my needs. > > I created a very simple function based on the reg=exp that you suggested > and tested it against a number of cases: > > > #lang racket > (require test-engine/racket-tests) > > (check-expect (alpha? "") #f) ; > empty string > (check-expect (alpha? "1") #f) > (check-expect (alpha? "a") #t) > (check-expect (alpha? "hello") #t) > (check-expect (alpha? "h1llo") #f) > (check-expect (alpha? "\u00E7c\u0327") #t) ; çç > (check-expect (alpha? "noe\u0308l") #t) ; noél > (check-expect (alpha? "\U01D122") #f) ; 𝄢 (bass > clef) > (check-expect (alpha? "\u216B") #f) ; Ⅻ (roman > numeral) > (check-expect (alpha? "\u0BEB") #f) ; ௫ (5 in > Tamil) > (check-expect (alpha? "二の句") #t) ; Japanese > word "ninoku" > (check-expect (alpha? "مدينة") #t) ; Arabic > word "madina" > (check-expect (alpha? "٥") #f) ; > Arabic number 5 > (check-expect (alpha? "\u0628\uFCF2") #t) ; Arabic letter > beh with shaddah > (define (alpha? s) > (regexp-match? #px"^\\p{L}+$" (string-normalize-nfc s))) > (test) > > I suspect that there are some cases with scripts requiring multiple code > points to render a single character such as Arabic with pronunciation marks > e.g. دُ نْيَا. At the moment, I don’t have the time (or need) to > investigate further. > > The depth of Racket’s Unicode support is impressive. > > Once again, thanks. > > Peter > > > > On 10 Jul 2020, at 15:47, Ryan Culpepper <[email protected]> wrote: > > > > (I see this went off the mailing list. If you reply, please consider > CCing the list.) > > > > Yes, I understood your goal of trying to capture the notion of Unicode > "alphabetic" characters with a regular expression. > > > > As far as I know, Unicode doesn't have a notion of "alphabetic", but it > does assign every code point to a "General category", consisting of a main > category and a subcategory. There is a category called "Letter", which > seems like one reasonable generalization of "alphabetic". > > > > In Racket, you can get the code for a character's category using > `char-general-category`. For example: > > > > > (char-general-category #\A) > > 'lu > > > (char-general-category #\é) > > 'll > > > (char-general-category #\ß) > > 'll > > > (char-general-category #\7) > > 'nd > > > > The general category for "A" is "Letter, uppercase", which has the code > "Lu", which Racket turns into the symbol 'lu. The general category of "é" > is "Letter, lowercase", code "Ll", which becomes 'll. The general category > of "7" is "Number, decimal digit", code "Nd". > > > > In Racket regular expressions, the \p{category} syntax lets you > recognize characters from a specific category. For example, \p{Lu} > recognizes characters with the category "Letter, uppercase", and \p{L} > recognizes characters with the category "Letter", which is the union of > "Letter, uppercase", "Letter, lowercase", and so on. > > > > So the regular expression #px"^\\p{L}+$" recognizes sequences of one or > more Unicode letters. For example: > > > > > (regexp-match? #px"^\\p{L}+$" "héllo") > > #t > > > (regexp-match? #px"^\\p{L}+$" "straße") > > #t > > > (regexp-match? #px"^\\p{L}+$" "二の句") > > #t > > > (regexp-match? #px"^\\p{L}+$" "abc123") > > #f ;; No, contains numbers > > > > There are still some problems to watch out for, though. For example, > accented characters like "é" can be expressed as a single pre-composed code > point or "decomposed" into a base letter and a combining mark. You can get > the decomposed form by converting the string to "decomposed normal form" > (NFD), and the regexp above won't match that string. > > > > > (map char-general-category (string->list (string-normalize-nfd "é"))) > > '(ll mn) > > > (regexp-match? #px"^\\p{L}+$" (string-normalize-nfd "héllo")) > > #f > > > > One fix would be to call `string-normalize-nfc` first, but some > letter-modifier pairs don't have pre-composed versions. Another fix would > be to expand the regexp to include modifiers. You'd have to decide which is > better based on your application. > > > > Ryan > > > > -- > You received this message because you are subscribed to the Google Groups > "Racket Users" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To view this discussion on the web visit > https://groups.google.com/d/msgid/racket-users/09B244A4-89C5-4B5C-97E7-5487059125F6%40gmail.com > . > -- You received this message because you are subscribed to the Google Groups "Racket Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/racket-users/CANy33qmWderKJgG%2Bqqw7k__ccZUg-KmX3U4RBfB-SAb4H%2BoNoQ%40mail.gmail.com.

