Re: [racket-users] Are Regular Expression classes Unicode aware?

Ryan Culpepper Sat, 11 Jul 2020 04:42:14 -0700

Great, I'm glad it was useful!

Ryan



On Sat, Jul 11, 2020 at 12:27 PM Peter W A Wood <[email protected]>
wrote:

> Dear Ryan
>
> Thank you for both your full, complete and understandable explanation and
> a working solution which is more than sufficient for my needs.
>
> I created a very simple function based on the reg=exp that you suggested
> and tested it against a number of cases:
>
>
> #lang racket
> (require test-engine/racket-tests)
>
> (check-expect (alpha? "") #f)                                       ;
> empty string
> (check-expect (alpha? "1") #f)
> (check-expect (alpha? "a") #t)
> (check-expect (alpha? "hello") #t)
> (check-expect (alpha? "h1llo") #f)
> (check-expect (alpha? "\u00E7c\u0327") #t)               ; çç
> (check-expect (alpha? "noe\u0308l") #t)                     ; noél
> (check-expect (alpha? "\U01D122") #f)                       ; 𝄢 (bass
> clef)
> (check-expect (alpha? "\u216B") #f)                           ; Ⅻ (roman
> numeral)
> (check-expect (alpha? "\u0BEB") #f)                           ; ௫ (5 in
> Tamil)
> (check-expect (alpha? "二の句") #t)                            ; Japanese
> word "ninoku"
> (check-expect (alpha? "مدينة") #t)                                ; Arabic
> word "madina"
> (check-expect (alpha? "٥") #f)                                     ;
> Arabic number 5
> (check-expect (alpha? "\u0628\uFCF2") #t)                ; Arabic letter
> beh with shaddah
> (define (alpha? s)
>  (regexp-match? #px"^\\p{L}+$" (string-normalize-nfc s)))
> (test)
>
> I suspect that there are some cases with scripts requiring multiple code
> points to render a single character such as Arabic with pronunciation marks
> e.g. دُ نْيَا. At the moment, I don’t have the time (or need) to
> investigate further.
>
> The depth of Racket’s Unicode support is impressive.
>
> Once again, thanks.
>
> Peter
>
>
> > On 10 Jul 2020, at 15:47, Ryan Culpepper <[email protected]> wrote:
> >
> > (I see this went off the mailing list. If you reply, please consider
> CCing the list.)
> >
> > Yes, I understood your goal of trying to capture the notion of Unicode
> "alphabetic" characters with a regular expression.
> >
> > As far as I know, Unicode doesn't have a notion of "alphabetic", but it
> does assign every code point to a "General category", consisting of a main
> category and a subcategory. There is a category called "Letter", which
> seems like one reasonable generalization of "alphabetic".
> >
> > In Racket, you can get the code for a character's category using
> `char-general-category`. For example:
> >
> >   > (char-general-category #\A)
> >   'lu
> >   > (char-general-category #\é)
> >   'll
> >   > (char-general-category #\ß)
> >   'll
> >   > (char-general-category #\7)
> >   'nd
> >
> > The general category for "A" is "Letter, uppercase", which has the code
> "Lu", which Racket turns into the symbol 'lu. The general category of "é"
> is "Letter, lowercase", code "Ll", which becomes 'll. The general category
> of "7" is "Number, decimal digit", code "Nd".
> >
> > In Racket regular expressions, the \p{category} syntax lets you
> recognize characters from a specific category. For example, \p{Lu}
> recognizes characters with the category "Letter, uppercase", and \p{L}
> recognizes characters with the category "Letter", which is the union of
> "Letter, uppercase", "Letter, lowercase", and so on.
> >
> > So the regular expression #px"^\\p{L}+$" recognizes sequences of one or
> more Unicode letters. For example:
> >
> >   > (regexp-match? #px"^\\p{L}+$" "héllo")
> >   #t
> >   > (regexp-match? #px"^\\p{L}+$" "straße")
> >   #t
> >   > (regexp-match? #px"^\\p{L}+$" "二の句")
> >   #t
> >   > (regexp-match? #px"^\\p{L}+$" "abc123")
> >   #f ;; No, contains numbers
> >
> > There are still some problems to watch out for, though. For example,
> accented characters like "é" can be expressed as a single pre-composed code
> point or "decomposed" into a base letter and a combining mark. You can get
> the decomposed form by converting the string to "decomposed normal form"
> (NFD), and the regexp above won't match that string.
> >
> >   > (map char-general-category (string->list (string-normalize-nfd "é")))
> >   '(ll mn)
> >   > (regexp-match? #px"^\\p{L}+$" (string-normalize-nfd "héllo"))
> >   #f
> > 
> > One fix would be to call `string-normalize-nfc` first, but some
> letter-modifier pairs don't have pre-composed versions. Another fix would
> be to expand the regexp to include modifiers. You'd have to decide which is
> better based on your application.
> >
> > Ryan
> >
>
> --
> You received this message because you are subscribed to the Google Groups
> "Racket Users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/racket-users/09B244A4-89C5-4B5C-97E7-5487059125F6%40gmail.com
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"Racket Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/racket-users/CANy33qmWderKJgG%2Bqqw7k__ccZUg-KmX3U4RBfB-SAb4H%2BoNoQ%40mail.gmail.com.

Re: [racket-users] Are Regular Expression classes Unicode aware?

Reply via email to