rust frontend and UTF-8/unicode processing/properties

2021-07-18 Thread Mark Wielaard
Hi,

For the gcc rust frontend I was thinking of importing a couple of
gnulib modules to help with UTF-8 processing, conversion to/from
unicode codepoints and determining various properties of those
codepoints. But it seems gcc doesn't yet have any gnulib modules
imported, and maybe other frontends already have helpers to this that
the gcc rust frontend could reuse.

Rust only accepts valid UTF-8 encoded source files, which may or may
not start with UTF-8 BOM character. Whitespace is any codepoint with
the Pattern_White_Space property. Identifiers can start with any
codepoint with the XID_start property plus zero or one codepoints with
XID_continue property. It isn't required, but highly desirable to
detect confusable identifiers according to tr39/Confusable_Detection.

Other names might be constraint to Alphabetic and/or Number categories
(Nd, Nl, No), textual types can only contain Unicode Scalar Values
(any Unicode codepoint except high-surrogate and low-surrogates),
strings in source code can contain unicode escapes (24 bit, up to 6
digits codepoints) but are internally stored as UTF-8 (and must not
encode any surrogates).

Do other gcc frontends handle any of the above already in a way that
might be reusable for other frontends?

Thanks,

Mark

-- 
Gcc-rust mailing list
Gcc-rust@gcc.gnu.org
https://gcc.gnu.org/mailman/listinfo/gcc-rust


Re: rust frontend and UTF-8/unicode processing/properties

2021-07-18 Thread Ian Lance Taylor via Gcc-rust
On Sun, Jul 18, 2021 at 6:23 AM Mark Wielaard  wrote:
>
> For the gcc rust frontend I was thinking of importing a couple of
> gnulib modules to help with UTF-8 processing, conversion to/from
> unicode codepoints and determining various properties of those
> codepoints. But it seems gcc doesn't yet have any gnulib modules
> imported, and maybe other frontends already have helpers to this that
> the gcc rust frontend could reuse.
>
> Rust only accepts valid UTF-8 encoded source files, which may or may
> not start with UTF-8 BOM character. Whitespace is any codepoint with
> the Pattern_White_Space property. Identifiers can start with any
> codepoint with the XID_start property plus zero or one codepoints with
> XID_continue property. It isn't required, but highly desirable to
> detect confusable identifiers according to tr39/Confusable_Detection.
>
> Other names might be constraint to Alphabetic and/or Number categories
> (Nd, Nl, No), textual types can only contain Unicode Scalar Values
> (any Unicode codepoint except high-surrogate and low-surrogates),
> strings in source code can contain unicode escapes (24 bit, up to 6
> digits codepoints) but are internally stored as UTF-8 (and must not
> encode any surrogates).
>
> Do other gcc frontends handle any of the above already in a way that
> might be reusable for other frontends?

I don't know that this is particularly helpful, but the Go frontend
has this kind of code in gcc/go/gofrontend/lex.cc.  E.g.,
Lex::fetch_char, Lex::advance_one_utf8_char, unicode_space,
unicode_digits, unicode_letters, Lex::is_unicode_space, etc.  But you
probably won't be able to use the code directly, and the code in the
gofrontend directory is also shared with GoLLVM so it can't trivially
be moved.

Ian
-- 
Gcc-rust mailing list
Gcc-rust@gcc.gnu.org
https://gcc.gnu.org/mailman/listinfo/gcc-rust


Re: Fwd: New contributor tasks

2021-07-18 Thread Mark Wielaard
Hi Joel,

On Sat, Jul 17, 2021 at 10:25:48PM +0800, The Other wrote:
> >> - I noticed some expressions didn't parse because of what looks to me
> >>   operator precedence issues. e.g the following:
> >>
> >>   const S: usize = 64;
> >>
> >>   pub fn main ()
> >>   {
> >> let a:u8 = 1;
> >> let b:u8 = 2;
> >> let _c = S * a as usize + b as usize;
> >>   }
> >>
> >>   $ gcc/gccrs -Bgcc as.rs
> >>
> >>   as.rs:7:27: error: type param bounds (in TraitObjectType) are not
> >>   allowed as TypeNoBounds
> >> 7 |   let _c = S * a as usize + b as usize;
> >>   |   ^
> >>
> >>   How does one fix such operator precedence issues in the parser?
> 
> > Off the top of my head it looks as though the parse_type_cast_expr has a
> > FIXME for the precedence issue for it. The Pratt parser uses the notion
> > of binding powers to handle this and i think it needs to follow in a
> > similar style to the ::parse_expr piece.
> 
> Yes, this is probably a precedence issue. The actual issue is that while
> expressions have precedence, types (such as "usize", which is what is being
> parsed) do not, and greedily parse tokens like "+". Additionally, the
> interactions of types and expressions and precedence between them is
> something that I have no idea how to approach.
> I believe that this specific issue could be fixed by modifying the
> parse_type_no_bounds method - if instead of erroring when finding a plus,
> it simply returned (treating it like an expression would treat a semicolon,
> basically), then this would have the desired functionality. I don't believe
> that parse_type_no_bounds (TypeNoBounds do not have '+' in them) would ever
> be called in an instance where a Type (that allows bounds) is allowable, so
> this change should hopefully not cause any correct programs to parse
> incorrectly.

I think you are correct. The issue is that parse_type_no_bounds tries
to be helpful and greedily looks for a PLUS so it can produce an
error. Simply removing that case makes things parse.

Patch attached and also here:
https://code.wildebeest.org/git/user/mjw/gccrs/commit/?h=as-type

This cannot be fully tested yet, because as Cast Expressions aren't
lowered from AST to HIR yet. I didn't get very far trying to lower the
CastExpr to HIR. This is what I came up with. But I didn't know how to
handle the type path yet.

diff --git a/gcc/rust/hir/rust-ast-lower-expr.h 
b/gcc/rust/hir/rust-ast-lower-expr.h
index 19ce8c2cf1f..96f6073cd86 100644
--- a/gcc/rust/hir/rust-ast-lower-expr.h
+++ b/gcc/rust/hir/rust-ast-lower-expr.h
@@ -405,6 +405,24 @@ public:
   expr.get_locus ());
   }
 
+  void visit (AST::TypeCastExpr &expr) override
+  {
+HIR::Expr *expr_to_cast_to
+  = ASTLoweringExpr::translate (expr.get_casted_expr ().get ());
+
+HIR::TypeNoBounds *type_to_cast_to
+  = nullptr; /* ... (expr._get_type_to_cast_to ().get ())); */
+
+auto crate_num = mappings->get_current_crate ();
+Analysis::NodeMapping mapping (crate_num, expr.get_node_id (),
+  mappings->get_next_hir_id (crate_num),
+  UNKNOWN_LOCAL_DEFID);
+
+translated = new HIR::TypeCastExpr (
+  mapping, std::unique_ptr (expr_to_cast_to),
+  std::unique_ptr (type_to_cast_to), expr.get_locus ());
+  }
+
   /* Compound assignment expression is compiled away. */
   void visit (AST::CompoundAssignmentExpr &expr) override
   {

It does get us a little bit further into the type checker:

as2.rs:7:12: error: failed to type resolve expression
7 |   let _c = a as usize + b as usize;
  |^
as2.rs:7:25: error: failed to type resolve expression
7 |   let _c = a as usize + b as usize;


Cheers,

Mark>From 4c92de44cde1bdd8d0fcb8a19adafd529d6c759c Mon Sep 17 00:00:00 2001
From: Mark Wielaard 
Date: Sun, 18 Jul 2021 22:12:20 +0200
Subject: [PATCH] Remove error handling in parse_type_no_bounds for PLUS token

parse_type_no_bounds tries to be helpful and greedily looks for a PLUS
token after having parsed a typepath so it can produce an error. But
that error breaks parsing expressions that contain "as" Cast
Expressions like "a as usize + b as usize". Drop the explicit error on
seeing a PLUS token and just return the type path parsed.
---
 gcc/rust/parse/rust-parse-impl.h | 7 ---
 1 file changed, 7 deletions(-)

diff --git a/gcc/rust/parse/rust-parse-impl.h b/gcc/rust/parse/rust-parse-impl.h
index eedc76db43e..a0607926950 100644
--- a/gcc/rust/parse/rust-parse-impl.h
+++ b/gcc/rust/parse/rust-parse-impl.h
@@ -9996,13 +9996,6 @@ Parser::parse_type_no_bounds ()
    std::move (tok_tree)),
 		  {}, locus));
 	}
-	  case PLUS:
-	// type param bounds - not allowed, here for error message
-	add_error (Error (t->get_locus (),
-			  "type param bounds (in TraitObjectType) are not "
-			  "allowed as TypeNoBounds"));
-
-	return nullptr;
 	  default:
 	// assume that this is a type path 

Re: rust frontend and UTF-8/unicode processing/properties

2021-07-18 Thread Jason Merrill via Gcc-rust
On Sun, Jul 18, 2021 at 1:13 PM Ian Lance Taylor via Gcc 
wrote:

> On Sun, Jul 18, 2021 at 6:23 AM Mark Wielaard  wrote:
> >
> > For the gcc rust frontend I was thinking of importing a couple of
> > gnulib modules to help with UTF-8 processing, conversion to/from
> > unicode codepoints and determining various properties of those
> > codepoints. But it seems gcc doesn't yet have any gnulib modules
> > imported, and maybe other frontends already have helpers to this that
> > the gcc rust frontend could reuse.
> >
> > Rust only accepts valid UTF-8 encoded source files, which may or may
> > not start with UTF-8 BOM character. Whitespace is any codepoint with
> > the Pattern_White_Space property. Identifiers can start with any
> > codepoint with the XID_start property plus zero or one codepoints with
> > XID_continue property. It isn't required, but highly desirable to
> > detect confusable identifiers according to tr39/Confusable_Detection.
> >
> > Other names might be constraint to Alphabetic and/or Number categories
> > (Nd, Nl, No), textual types can only contain Unicode Scalar Values
> > (any Unicode codepoint except high-surrogate and low-surrogates),
> > strings in source code can contain unicode escapes (24 bit, up to 6
> > digits codepoints) but are internally stored as UTF-8 (and must not
> > encode any surrogates).
> >
> > Do other gcc frontends handle any of the above already in a way that
> > might be reusable for other frontends?
>
> I don't know that this is particularly helpful, but the Go frontend
> has this kind of code in gcc/go/gofrontend/lex.cc.  E.g.,
> Lex::fetch_char, Lex::advance_one_utf8_char, unicode_space,
> unicode_digits, unicode_letters, Lex::is_unicode_space, etc.  But you
> probably won't be able to use the code directly, and the code in the
> gofrontend directory is also shared with GoLLVM so it can't trivially
> be moved.
>

I believe the UTF-8 handling for the C family front ends is all in libcpp;
I don't think it's factored in a way to be useful to other front ends.

Jason
-- 
Gcc-rust mailing list
Gcc-rust@gcc.gnu.org
https://gcc.gnu.org/mailman/listinfo/gcc-rust