On Sun, Jul 18, 2021 at 6:23 AM Mark Wielaard <m...@klomp.org> wrote: > > For the gcc rust frontend I was thinking of importing a couple of > gnulib modules to help with UTF-8 processing, conversion to/from > unicode codepoints and determining various properties of those > codepoints. But it seems gcc doesn't yet have any gnulib modules > imported, and maybe other frontends already have helpers to this that > the gcc rust frontend could reuse. > > Rust only accepts valid UTF-8 encoded source files, which may or may > not start with UTF-8 BOM character. Whitespace is any codepoint with > the Pattern_White_Space property. Identifiers can start with any > codepoint with the XID_start property plus zero or one codepoints with > XID_continue property. It isn't required, but highly desirable to > detect confusable identifiers according to tr39/Confusable_Detection. > > Other names might be constraint to Alphabetic and/or Number categories > (Nd, Nl, No), textual types can only contain Unicode Scalar Values > (any Unicode codepoint except high-surrogate and low-surrogates), > strings in source code can contain unicode escapes (24 bit, up to 6 > digits codepoints) but are internally stored as UTF-8 (and must not > encode any surrogates). > > Do other gcc frontends handle any of the above already in a way that > might be reusable for other frontends?
I don't know that this is particularly helpful, but the Go frontend has this kind of code in gcc/go/gofrontend/lex.cc. E.g., Lex::fetch_char, Lex::advance_one_utf8_char, unicode_space, unicode_digits, unicode_letters, Lex::is_unicode_space, etc. But you probably won't be able to use the code directly, and the code in the gofrontend directory is also shared with GoLLVM so it can't trivially be moved. Ian