UTF-8 BOM handling

2021-07-05 Thread Mark Wielaard
Hi, A rust source file can start with a UTF-8 BOM sequence (EF BB BF). This simply indicates that the file is encoded as UTF-8 (all rust input is interpreted as asequence of Unicode code points encoded in UTF-8) so can be skipped before starting real lexing. It isn't necessary to keep track of th

[PATCH 1/2] Handle UTF-8 BOM in lexer

2021-07-05 Thread Mark Wielaard
The very first thing in a rust source file might be the optional UTF-8 BOM. This is the 3 bytes 0xEF 0xBB 0xBF. They can simply be skipped, they just mark the file as UTF-8. Add some testcases to show we now handle such files. --- gcc/rust/lex/rust-lex.cc| 13

[PATCH 2/2] Remove has_utf8bom flag from AST and HIR Crate classes

2021-07-05 Thread Mark Wielaard
The lexer deals with the UTF-8 BOM and the parser cannot detect whether there is or isn't a BOM at the start of a file. The flag isn't relevant or useful in the AST and HIR Crate classes. --- gcc/rust/ast/rust-ast-full-test.cc | 3 --- gcc/rust/ast/rust-ast.h | 11 +++