Re: rust frontend and UTF-8/unicode processing/properties

2021-07-18 Thread Jason Merrill via Gcc-rust
On Sun, Jul 18, 2021 at 1:13 PM Ian Lance Taylor via Gcc 
wrote:

> On Sun, Jul 18, 2021 at 6:23 AM Mark Wielaard  wrote:
> >
> > For the gcc rust frontend I was thinking of importing a couple of
> > gnulib modules to help with UTF-8 processing, conversion to/from
> > unicode codepoints and determining various properties of those
> > codepoints. But it seems gcc doesn't yet have any gnulib modules
> > imported, and maybe other frontends already have helpers to this that
> > the gcc rust frontend could reuse.
> >
> > Rust only accepts valid UTF-8 encoded source files, which may or may
> > not start with UTF-8 BOM character. Whitespace is any codepoint with
> > the Pattern_White_Space property. Identifiers can start with any
> > codepoint with the XID_start property plus zero or one codepoints with
> > XID_continue property. It isn't required, but highly desirable to
> > detect confusable identifiers according to tr39/Confusable_Detection.
> >
> > Other names might be constraint to Alphabetic and/or Number categories
> > (Nd, Nl, No), textual types can only contain Unicode Scalar Values
> > (any Unicode codepoint except high-surrogate and low-surrogates),
> > strings in source code can contain unicode escapes (24 bit, up to 6
> > digits codepoints) but are internally stored as UTF-8 (and must not
> > encode any surrogates).
> >
> > Do other gcc frontends handle any of the above already in a way that
> > might be reusable for other frontends?
>
> I don't know that this is particularly helpful, but the Go frontend
> has this kind of code in gcc/go/gofrontend/lex.cc.  E.g.,
> Lex::fetch_char, Lex::advance_one_utf8_char, unicode_space,
> unicode_digits, unicode_letters, Lex::is_unicode_space, etc.  But you
> probably won't be able to use the code directly, and the code in the
> gofrontend directory is also shared with GoLLVM so it can't trivially
> be moved.
>

I believe the UTF-8 handling for the C family front ends is all in libcpp;
I don't think it's factored in a way to be useful to other front ends.

Jason
-- 
Gcc-rust mailing list
Gcc-rust@gcc.gnu.org
https://gcc.gnu.org/mailman/listinfo/gcc-rust


Re: [PATCH Rust front-end v2 01/37] Use DW_ATE_UTF for the Rust 'char' type

2022-08-24 Thread Jason Merrill via Gcc-rust

On 8/24/22 04:59, herron.phi...@googlemail.com wrote:

From: Tom Tromey 

The Rust 'char' type should use the DWARF DW_ATE_UTF encoding.


The DWARF changes are OK.


---
  gcc/dwarf2out.cc | 23 ++-
  1 file changed, 22 insertions(+), 1 deletion(-)

diff --git a/gcc/dwarf2out.cc b/gcc/dwarf2out.cc
index e3920c898f5..a8bccbabca4 100644
--- a/gcc/dwarf2out.cc
+++ b/gcc/dwarf2out.cc
@@ -5600,6 +5600,16 @@ is_fortran (const_tree decl)
return is_fortran ();
  }
  
+/* Return TRUE if the language is Rust.  */

+
+static inline bool
+is_rust ()
+{
+  unsigned int lang = get_AT_unsigned (comp_unit_die (), DW_AT_language);
+
+  return lang == DW_LANG_Rust || lang == DW_LANG_Rust_old;
+}
+
  /* Return TRUE if the language is Ada.  */
  
  static inline bool

@@ -13231,7 +13241,11 @@ base_type_die (tree type, bool reverse)
}
if (TYPE_STRING_FLAG (type))
{
- if (TYPE_UNSIGNED (type))
+ if ((dwarf_version >= 4 || !dwarf_strict)
+ && is_rust ()
+ && int_size_in_bytes (type) == 4)
+   encoding = DW_ATE_UTF;
+ else if (TYPE_UNSIGNED (type))
encoding = DW_ATE_unsigned_char;
  else
encoding = DW_ATE_signed_char;
@@ -25201,6 +25215,13 @@ gen_compile_unit_die (const char *filename)
  }
else if (strcmp (language_string, "GNU F77") == 0)
  language = DW_LANG_Fortran77;
+  else if (strcmp (language_string, "GNU Rust") == 0)
+{
+  if (dwarf_version >= 5 || !dwarf_strict)
+   language = DW_LANG_Rust;
+  else
+   language = DW_LANG_Rust_old;
+}
else if (dwarf_version >= 3 || !dwarf_strict)
  {
if (strcmp (language_string, "GNU Ada") == 0)


--
Gcc-rust mailing list
Gcc-rust@gcc.gnu.org
https://gcc.gnu.org/mailman/listinfo/gcc-rust