https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105959
David Malcolm <dmalcolm at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Last reconfirmed|2023-01-30 00:00:00 |2023-03-16
Ever confirmed|0 |1
Status|UNCONFIRMED |ASSIGNED
--- Comment #7 from David Malcolm <dmalcolm at gcc dot gnu.org> ---
Aha! - thanks for the information.
I think GCC is writing out the .sarif file in UTF-8 form regardless of the
environment on everyone's box. The issue seems to be this line in the testcase
to check for the UTF-8 in the "snippet" output:
{ dg-final { scan-sarif-file "\"text\": \" int
\\u6587\\u5b57\\u5316\\u3051 = " } }
that's failing somewhere within DejaGnu, presumably due to the environment
differences.
There some variation due to json::object using a hash_map for the key/value
pairs, which means (annoyingly) it outputs things in arbitrary order, leading
to non-determinism in the .sarif content.
Perhaps it's possible to express byte-level matching in Tcl? I'll have a look.
Details
=======
The source code (gcc/testsuite/c-c++-common/diagnostic-format-sarif-file-4.c)
is indeed UTF-8 encoded; looking at the output of
./contrib/unicode/utf8-dump.py, I see this for line 7:
VVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVV
7 | int 文字化け = *42;
| U+0020 0x20 SPACE
(separator)
| U+0020 0x20 SPACE
(separator)
| U+0069 0x69 LATIN SMALL LETTER I i
| U+006E 0x6e LATIN SMALL LETTER N n
| U+0074 0x74 LATIN SMALL LETTER T t
| U+0020 0x20 SPACE
(separator)
| U+6587 0xe6 0x96 0x87 CJK UNIFIED IDEOGRAPH-6587 文
| U+5B57 0xe5 0xad 0x97 CJK UNIFIED IDEOGRAPH-5B57 字
| U+5316 0xe5 0x8c 0x96 CJK UNIFIED IDEOGRAPH-5316 化
| U+3051 0xe3 0x81 0x91 HIRAGANA LETTER KE け
| U+0020 0x20 SPACE
(separator)
| U+003D 0x3d EQUALS SIGN =
| U+0020 0x20 SPACE
(separator)
| U+002A 0x2a ASTERISK *
| U+0034 0x34 DIGIT FOUR 4
| U+0032 0x32 DIGIT TWO 2
| U+003B 0x3b SEMICOLON ;
| U+000A 0x0a LINE FEED (LF)
(control character)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Looking at the output on my box via:
hexdump -C testsuite/gcc/diagnostic-format-sarif-file-4.c.sarif|less
and looking for "snippet" shows:
VVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVV
000005a0 3a 20 7b 22 63 6f 6e 74 65 78 74 52 65 67 69 6f |: {"contextRegio|
000005b0 6e 22 3a 20 7b 22 73 74 61 72 74 4c 69 6e 65 22 |n": {"startLine"|
000005c0 3a 20 37 2c 20 22 73 6e 69 70 70 65 74 22 3a 20 |: 7, "snippet": |
000005d0 7b 22 74 65 78 74 22 3a 20 22 20 20 69 6e 74 20 |{"text": " int |
000005e0 e6 96 87 e5 ad 97 e5 8c 96 e3 81 91 20 3d 20 2a |............ = *|
000005f0 34 32 3b 5c 6e 22 7d 7d 2c 20 22 61 72 74 69 66 |42;\n"}}, "artif|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
where it's been encoded in UTF-8 as:
e6 96 87 e5 ad 97 e5 8c 96 e3 81 91 20 3d
which I can confirm with ./contrib/unicode/utf8-dump.py, which shows that the
snippet has been written in UTF-8 form:
VVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVV
| U+0069 0x69 LATIN SMALL LETTER I i
| U+006E 0x6e LATIN SMALL LETTER N n
| U+0074 0x74 LATIN SMALL LETTER T t
| U+0020 0x20 SPACE
(separator)
| U+6587 0xe6 0x96 0x87 CJK UNIFIED IDEOGRAPH-6587 文
| U+5B57 0xe5 0xad 0x97 CJK UNIFIED IDEOGRAPH-5B57 字
| U+5316 0xe5 0x8c 0x96 CJK UNIFIED IDEOGRAPH-5316 化
| U+3051 0xe3 0x81 0x91 HIRAGANA LETTER KE け
| U+0020 0x20 SPACE
(separator)
| U+003D 0x3d EQUALS SIGN =
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The test case has:
{ dg-final { scan-sarif-file "\"text\": \" int \\u6587\\u5b57\\u5316\\u3051
= " } }
which is looking for the text of the snippet containing the unicode chars
Attachment 54658 (with md5sum 67cc5fdbee9006509aa38af635d6cf69) has this for
the snippet:
VVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVV
000005f0 73 6e 69 70 70 65 74 22 3a 20 7b 22 74 65 78 74 |snippet": {"text|
00000600 22 3a 20 22 20 20 69 6e 74 20 e6 96 87 e5 ad 97 |": " int ......|
00000610 e5 8c 96 e3 81 91 20 3d 20 2a 34 32 3b 5c 6e 22 |...... = *42;\n"|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
which is:
VVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVV
| U+0069 0x69 LATIN SMALL LETTER I i
| U+006E 0x6e LATIN SMALL LETTER N n
| U+0074 0x74 LATIN SMALL LETTER T t
| U+0020 0x20 SPACE
(separator)
| U+6587 0xe6 0x96 0x87 CJK UNIFIED IDEOGRAPH-6587 文
| U+5B57 0xe5 0xad 0x97 CJK UNIFIED IDEOGRAPH-5B57 字
| U+5316 0xe5 0x8c 0x96 CJK UNIFIED IDEOGRAPH-5316 化
| U+3051 0xe3 0x81 0x91 HIRAGANA LETTER KE け
| U+0020 0x20 SPACE
(separator)
| U+003D 0x3d EQUALS SIGN =
| U+0020 0x20 SPACE
(separator)
| U+002A 0x2a ASTERISK *
| U+0034 0x34 DIGIT FOUR 4
| U+0032 0x32 DIGIT TWO 2
| U+003B 0x3b SEMICOLON ;
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Hence GCC is also writing out the .sarif file in UTF-8 form in that attachment,
regardless of the environment; the issue is presumably within the handling of
this directive:
{ dg-final { scan-sarif-file "\"text\": \" int
\\u6587\\u5b57\\u5316\\u3051 = " } }