Fwd: New contributor tasks

2021-07-17 Thread The Other via Gcc-rust
Sorry, pressed the wrong button. I meant to "reply all".

-- Forwarded message -
From: The Other 
Date: Sat, Jul 17, 2021 at 10:20 PM
Subject: Re: New contributor tasks
To: Philip Herron 


> The AST dump (--rust-dump-parse) was actually useful for checking the
> comment doc strings, but it could certainly be improved. Ideally it
> would be structured in a way that can easily be used in tests.

Yes, I agree. It has its mismatched style because I originally intended it
to be basically "to_string" in the most literal sense possible, but then
realised this would be infeasible for some of the more complicated parts.
Theoretically, I would personally like it to be in a format similar to
clang's AST dump.

> - Full unicode/utf8 support in the lexer. Currently the lexer only
>   explicitly interprets the input as UTF8 for string parseing. It
>   should really treat all input as UTF-8. gnulib has some handy
>   modules we could use to read/convert from/to utf8 (unistr/u8-to-u32,
>   unistr/u32-to-u8) and test various unicode properties
>   (unictype/property-white-space, unictype/property-xid-continue,
>   unictype/property-xid-start). I don't know if we can import those or
>   if gcc already has these kind of UTF-8/unicode support functions for
>   other languages?

At the time of writing the lexer, I was under the impression that Rust only
supported UTF-8 in strings. The Rust Reference seems to have changed now to
show that it supports UTF-8 in identifiers as well. I believe that the C++
frontend, at least, has its own specific hardcoded UTF-8 handling for
identifiers and strings (rather than using a library).

There could be issues with lookahead of several bytes (which the lexer uses
liberally) if using UTF-8 in strings, depending on the exact implementation
of whatever library you use (or function you write).

>> - Error handling using rich locations in the lexer and parser.  It
>>   seems some support is already there, but it isn't totally clear to
>>   me what is already in place and what could/should be added. e.g. how
>>   to add notes to an Error.
> I've made a wrapper over RichLocation i had some crashes when i added
> methods for annotations. Overall my understanding is that a Location
> that we have at the moment is a single character location in the source
> code but Rustc uses Spans which might be an abstraction we could think
> about implementing instead of the Location wrapper we are reusing for
> GCCGO.

The Error class may need to be redesigned. It was a quick fix I made to
allow parse errors to be ignored (since macro expansion would cause parse
errors with non-matching macro matchers). Instead of having the
"emit_error" and "emit_fatal_error" methods, it may be better to instead
store a "kind" of error upon construction, and then just have an "emit"
method that will emit the type of error as specified.
Similarly, Error may have to be rewritten to use RichLocation instead of
Location or something.

>> - I noticed some expressions didn't parse because of what looks to me
>>   operator precedence issues. e.g the following:
>>
>>   const S: usize = 64;
>>
>>   pub fn main ()
>>   {
>> let a:u8 = 1;
>> let b:u8 = 2;
>> let _c = S * a as usize + b as usize;
>>   }
>>
>>   $ gcc/gccrs -Bgcc as.rs
>>
>>   as.rs:7:27: error: type param bounds (in TraitObjectType) are not
allowed as TypeNoBounds
>> 7 |   let _c = S * a as usize + b as usize;
>>   |   ^
>>
>>   How does one fix such operator precedence issues in the parser?

> Off the top of my head it looks as though the parse_type_cast_expr has a
> FIXME for the precedence issue for it. The Pratt parser uses the notion
> of binding powers to handle this and i think it needs to follow in a
> similar style to the ::parse_expr piece.

Yes, this is probably a precedence issue. The actual issue is that while
expressions have precedence, types (such as "usize", which is what is being
parsed) do not, and greedily parse tokens like "+". Additionally, the
interactions of types and expressions and precedence between them is
something that I have no idea how to approach.
I believe that this specific issue could be fixed by modifying the
parse_type_no_bounds method - if instead of erroring when finding a plus,
it simply returned (treating it like an expression would treat a semicolon,
basically), then this would have the desired functionality. I don't believe
that parse_type_no_bounds (TypeNoBounds do not have '+' in them) would ever
be called in an instance where a Type (that allows bounds) is allowable, so
this change should hopefully not cause any correct programs to parse
incorrectly.

>> - rust-macro-expand tries to handle both macros and attributes, is
>>  this by design?  Should we handle different passes for different
>>  (inert or not) attributes that run before or after macro expansion?
> As for macro and cfg expansion Joel some stuff already in place but i do
> think they need to be separated i

Re: Fwd: New contributor tasks

2021-07-17 Thread Mark Wielaard
Hi Joel,

On Sat, Jul 17, 2021 at 10:25:48PM +0800, The Other wrote:
> > - Full unicode/utf8 support in the lexer. Currently the lexer only
> >   explicitly interprets the input as UTF8 for string parseing. It
> >   should really treat all input as UTF-8. gnulib has some handy
> >   modules we could use to read/convert from/to utf8 (unistr/u8-to-u32,
> >   unistr/u32-to-u8) and test various unicode properties
> >   (unictype/property-white-space, unictype/property-xid-continue,
> >   unictype/property-xid-start). I don't know if we can import those or
> >   if gcc already has these kind of UTF-8/unicode support functions for
> >   other languages?
> 
> At the time of writing the lexer, I was under the impression that Rust only
> supported UTF-8 in strings. The Rust Reference seems to have changed now to
> show that it supports UTF-8 in identifiers as well. I believe that the C++
> frontend, at least, has its own specific hardcoded UTF-8 handling for
> identifiers and strings (rather than using a library).
> 
> There could be issues with lookahead of several bytes (which the lexer uses
> liberally) if using UTF-8 in strings, depending on the exact implementation
> of whatever library you use (or function you write).

The whole source file should be valid UTF-8. You can use it in
comments too. And any invalid UTF-8 encoding means the file isn't a
valid Rust source file. So the simplest is to make the lexer handle
UTF-8 and handle one codepoint (UCS4/32bits) at a time. Lookahead then
also simply works per codepoint. We would still store strings as
UTF-8. gnulib contains various helpers to convert to/from utf-8/ucs4
and to test various unicode properties of codepoints. I'll ask on the
gcc mailinglist whether to use the C++ frontend support or import the
gnulib helpers.

> >> - rust-macro-expand tries to handle both macros and attributes, is
> >>  this by design?  Should we handle different passes for different
> >>  (inert or not) attributes that run before or after macro expansion?
> > As for macro and cfg expansion Joel some stuff already in place but i do
> > think they need to be separated into distinct passes which would be a
> > good first start with the expand folder.
> 
> That is a good question. Technically, rust-macro-expand only handles cfg
> expansion at the moment. You can read and discuss more about that here:
> https://github.com/Rust-GCC/gccrs/issues/563

I have to think about whether it makes sense to handle the cfg
attribute and the !cfg macro rules in hte same pass/expansion. The
!cfg macro seems so simple it could be handled immediately by the
parser since it only relies on the compiler/host attributes and simply
generates a true or false token.

In general it seems attribute expansion cannot be simply done by one
AttributeVisitor pass because the effect can be at different stages of
parsing (and they can even affect what the lexer accepts -
e.g. whether identifiers as unicode strings are accepted). For example
the various lint attributes can warn/error/etc when lowering the final
AST (CamelCaseStructs for example), after type checking or after
lifeness analysis. So maybe we need to design a pass for each
different attribute and not try to combine them (except maybe to
recognize and validate the attribute syntax).

Cheers,

Mark
-- 
Gcc-rust mailing list
Gcc-rust@gcc.gnu.org
https://gcc.gnu.org/mailman/listinfo/gcc-rust