properties

Ian Lance Taylor via Gcc Sun, 18 Jul 2021 13:12:47 -0700

On Sun, Jul 18, 2021 at 6:23 AM Mark Wielaard <m...@klomp.org> wrote:
>
> For the gcc rust frontend I was thinking of importing a couple of
> gnulib modules to help with UTF-8 processing, conversion to/from
> unicode codepoints and determining various properties of those
> codepoints. But it seems gcc doesn't yet have any gnulib modules
> imported, and maybe other frontends already have helpers to this that
> the gcc rust frontend could reuse.
>
> Rust only accepts valid UTF-8 encoded source files, which may or may
> not start with UTF-8 BOM character. Whitespace is any codepoint with
> the Pattern_White_Space property. Identifiers can start with any
> codepoint with the XID_start property plus zero or one codepoints with
> XID_continue property. It isn't required, but highly desirable to
> detect confusable identifiers according to tr39/Confusable_Detection.
>
> Other names might be constraint to Alphabetic and/or Number categories
> (Nd, Nl, No), textual types can only contain Unicode Scalar Values
> (any Unicode codepoint except high-surrogate and low-surrogates),
> strings in source code can contain unicode escapes (24 bit, up to 6
> digits codepoints) but are internally stored as UTF-8 (and must not
> encode any surrogates).
>
> Do other gcc frontends handle any of the above already in a way that
> might be reusable for other frontends?


I don't know that this is particularly helpful, but the Go frontend
has this kind of code in gcc/go/gofrontend/lex.cc.  E.g.,
Lex::fetch_char, Lex::advance_one_utf8_char, unicode_space,
unicode_digits, unicode_letters, Lex::is_unicode_space, etc.  But you
probably won't be able to use the code directly, and the code in the
gofrontend directory is also shared with GoLLVM so it can't trivially
be moved.

Ian

Re: rust frontend and UTF-8/unicode processing/properties

Reply via email to