Re: [racket-users] Are Regular Expression classes Unicode aware?

Peter W A Wood Sat, 11 Jul 2020 03:28:14 -0700

Dear Ryan

Thank you for both your full, complete and understandable explanation and a 
working solution which is more than sufficient for my needs.


I created a very simple function based on the reg=exp that you suggested and 
tested it against a number of cases:


#lang racket
(require test-engine/racket-tests)

(check-expect (alpha? "") #f)                                       ; empty 
string
(check-expect (alpha? "1") #f)                           
(check-expect (alpha? "a") #t)                            
(check-expect (alpha? "hello") #t)
(check-expect (alpha? "h1llo") #f)
(check-expect (alpha? "\u00E7c\u0327") #t)               ; çç
(check-expect (alpha? "noe\u0308l") #t)                     ; noél
(check-expect (alpha? "\U01D122") #f)                       ; 𝄢 (bass clef)
(check-expect (alpha? "\u216B") #f)                           ; Ⅻ (roman 
numeral)
(check-expect (alpha? "\u0BEB") #f)                           ; ௫ (5 in Tamil)
(check-expect (alpha? "二の句") #t)                            ; Japanese word 
"ninoku"
(check-expect (alpha? "مدينة") #t)                                ; Arabic word 
"madina"
(check-expect (alpha? "٥") #f)                                     ; Arabic 
number 5
(check-expect (alpha? "\u0628\uFCF2") #t)                ; Arabic letter beh 
with shaddah
(define (alpha? s)
 (regexp-match? #px"^\\p{L}+$" (string-normalize-nfc s)))
(test)

I suspect that there are some cases with scripts requiring multiple code points 
to render a single character such as Arabic with pronunciation marks e.g. دُ 
نْيَا. At the moment, I don’t have the time (or need) to investigate further.  

The depth of Racket’s Unicode support is impressive.

Once again, thanks.

Peter


> On 10 Jul 2020, at 15:47, Ryan Culpepper <[email protected]> wrote:
> 
> (I see this went off the mailing list. If you reply, please consider CCing 
> the list.)
> 
> Yes, I understood your goal of trying to capture the notion of Unicode 
> "alphabetic" characters with a regular expression.
> 
> As far as I know, Unicode doesn't have a notion of "alphabetic", but it does 
> assign every code point to a "General category", consisting of a main 
> category and a subcategory. There is a category called "Letter", which seems 
> like one reasonable generalization of "alphabetic".
> 
> In Racket, you can get the code for a character's category using 
> `char-general-category`. For example:
> 
>   > (char-general-category #\A)
>   'lu
>   > (char-general-category #\é)
>   'll
>   > (char-general-category #\ß)
>   'll
>   > (char-general-category #\7)
>   'nd
> 
> The general category for "A" is "Letter, uppercase", which has the code "Lu", 
> which Racket turns into the symbol 'lu. The general category of "é" is 
> "Letter, lowercase", code "Ll", which becomes 'll. The general category of 
> "7" is "Number, decimal digit", code "Nd".
> 
> In Racket regular expressions, the \p{category} syntax lets you recognize 
> characters from a specific category. For example, \p{Lu} recognizes 
> characters with the category "Letter, uppercase", and \p{L} recognizes 
> characters with the category "Letter", which is the union of "Letter, 
> uppercase", "Letter, lowercase", and so on.
> 
> So the regular expression #px"^\\p{L}+$" recognizes sequences of one or more 
> Unicode letters. For example:
> 
>   > (regexp-match? #px"^\\p{L}+$" "héllo")
>   #t
>   > (regexp-match? #px"^\\p{L}+$" "straße")
>   #t
>   > (regexp-match? #px"^\\p{L}+$" "二の句")
>   #t
>   > (regexp-match? #px"^\\p{L}+$" "abc123")
>   #f ;; No, contains numbers
> 
> There are still some problems to watch out for, though. For example, accented 
> characters like "é" can be expressed as a single pre-composed code point or 
> "decomposed" into a base letter and a combining mark. You can get the 
> decomposed form by converting the string to "decomposed normal form" (NFD), 
> and the regexp above won't match that string.
> 
>   > (map char-general-category (string->list (string-normalize-nfd "é")))
>   '(ll mn)
>   > (regexp-match? #px"^\\p{L}+$" (string-normalize-nfd "héllo"))
>   #f
> 
> One fix would be to call `string-normalize-nfc` first, but some 
> letter-modifier pairs don't have pre-composed versions. Another fix would be 
> to expand the regexp to include modifiers. You'd have to decide which is 
> better based on your application.
> 
> Ryan
> 

-- 
You received this message because you are subscribed to the Google Groups 
"Racket Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/racket-users/09B244A4-89C5-4B5C-97E7-5487059125F6%40gmail.com.

Re: [racket-users] Are Regular Expression classes Unicode aware?

Reply via email to