Dear Ryan
Thank you for both your full, complete and understandable explanation and a
working solution which is more than sufficient for my needs.
I created a very simple function based on the reg=exp that you suggested and
tested it against a number of cases:
#lang racket
(require test-engine/racket-tests)
(check-expect (alpha? "") #f) ; empty
string
(check-expect (alpha? "1") #f)
(check-expect (alpha? "a") #t)
(check-expect (alpha? "hello") #t)
(check-expect (alpha? "h1llo") #f)
(check-expect (alpha? "\u00E7c\u0327") #t) ; çç
(check-expect (alpha? "noe\u0308l") #t) ; noél
(check-expect (alpha? "\U01D122") #f) ; 𝄢 (bass clef)
(check-expect (alpha? "\u216B") #f) ; Ⅻ (roman
numeral)
(check-expect (alpha? "\u0BEB") #f) ; ௫ (5 in Tamil)
(check-expect (alpha? "二の句") #t) ; Japanese word
"ninoku"
(check-expect (alpha? "مدينة") #t) ; Arabic word
"madina"
(check-expect (alpha? "٥") #f) ; Arabic
number 5
(check-expect (alpha? "\u0628\uFCF2") #t) ; Arabic letter beh
with shaddah
(define (alpha? s)
(regexp-match? #px"^\\p{L}+$" (string-normalize-nfc s)))
(test)
I suspect that there are some cases with scripts requiring multiple code points
to render a single character such as Arabic with pronunciation marks e.g. دُ
نْيَا. At the moment, I don’t have the time (or need) to investigate further.
The depth of Racket’s Unicode support is impressive.
Once again, thanks.
Peter
> On 10 Jul 2020, at 15:47, Ryan Culpepper <[email protected]> wrote:
>
> (I see this went off the mailing list. If you reply, please consider CCing
> the list.)
>
> Yes, I understood your goal of trying to capture the notion of Unicode
> "alphabetic" characters with a regular expression.
>
> As far as I know, Unicode doesn't have a notion of "alphabetic", but it does
> assign every code point to a "General category", consisting of a main
> category and a subcategory. There is a category called "Letter", which seems
> like one reasonable generalization of "alphabetic".
>
> In Racket, you can get the code for a character's category using
> `char-general-category`. For example:
>
> > (char-general-category #\A)
> 'lu
> > (char-general-category #\é)
> 'll
> > (char-general-category #\ß)
> 'll
> > (char-general-category #\7)
> 'nd
>
> The general category for "A" is "Letter, uppercase", which has the code "Lu",
> which Racket turns into the symbol 'lu. The general category of "é" is
> "Letter, lowercase", code "Ll", which becomes 'll. The general category of
> "7" is "Number, decimal digit", code "Nd".
>
> In Racket regular expressions, the \p{category} syntax lets you recognize
> characters from a specific category. For example, \p{Lu} recognizes
> characters with the category "Letter, uppercase", and \p{L} recognizes
> characters with the category "Letter", which is the union of "Letter,
> uppercase", "Letter, lowercase", and so on.
>
> So the regular expression #px"^\\p{L}+$" recognizes sequences of one or more
> Unicode letters. For example:
>
> > (regexp-match? #px"^\\p{L}+$" "héllo")
> #t
> > (regexp-match? #px"^\\p{L}+$" "straße")
> #t
> > (regexp-match? #px"^\\p{L}+$" "二の句")
> #t
> > (regexp-match? #px"^\\p{L}+$" "abc123")
> #f ;; No, contains numbers
>
> There are still some problems to watch out for, though. For example, accented
> characters like "é" can be expressed as a single pre-composed code point or
> "decomposed" into a base letter and a combining mark. You can get the
> decomposed form by converting the string to "decomposed normal form" (NFD),
> and the regexp above won't match that string.
>
> > (map char-general-category (string->list (string-normalize-nfd "é")))
> '(ll mn)
> > (regexp-match? #px"^\\p{L}+$" (string-normalize-nfd "héllo"))
> #f
>
> One fix would be to call `string-normalize-nfc` first, but some
> letter-modifier pairs don't have pre-composed versions. Another fix would be
> to expand the regexp to include modifiers. You'd have to decide which is
> better based on your application.
>
> Ryan
>
--
You received this message because you are subscribed to the Google Groups
"Racket Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/racket-users/09B244A4-89C5-4B5C-97E7-5487059125F6%40gmail.com.