[clang] [clang-tools-extra] Reland [clang][Sema, Lex, Parse] Preprocessor embed in C and C++ (PR #95802)

2024-08-15 Thread Jakub Jelínek via cfe-commits

jakubjelinek wrote:

@ThePhD @AaronBallman @cor3ntin
Joseph Myers raised an interesting question whether the tokens in 
embed-parameter-sequence are macro expanded or not.
Consider
```c
#define FILE "/etc/passwd"
#define LIMIT limit(1)
#define THIS , 1, 2, 3
#define PRE prefix (42,
ONE
#embed FILE LIMIT suffix(THIS) PRE )
TWO
#embed "/etc/passwd" LIMIT suffix(THIS) PRE )
THREE
#define limit prefix
#embed "/etc/passwd" limit (4) suffix (THIS)
```

The first #embed is I hope clear in that it is the #embed pp-tokens new-line 
case where everything gets macro expanded.
The second case is less clear, the filename part matches the " q-char-sequence 
" case, but LIMIT suffix(THIS) PRE ) is not valid embed-parameter-sequence, so 
shouldn't it be expanded too?
And the last case by strict reading shouldn't be macro expanded because it is 
valid embed-parameter-sequence, still
both the the.phd branch and clang trunk and also my GCC patchset handle it as 
prefix (4) suffix(, 1, 2, 3).
Would in that reading #embed "/etc/passwd" LIMIT suffix(THIS) be valid too and 
thus not macro expanded and thus later invalid?
And, if embed-parameter-sequence tokens are macro expanded only sometimes, what 
should happen say with
#define ARG 2) if_empty (1
#embed "file" limit (ARG)
?
On
```c
#define A "/etc/passwd" limit (1) ) + (0
#if __has_embed (A)
int i;
#endif
```
also all 3 compilers agree and happily use the closing ) from the macro to use 
the ) from the macro expansion for the closing ) of __has_embed and continue 
through to the rest of the expression.  Or is __has_embed supposed to be always 
macro expanded?

https://github.com/llvm/llvm-project/pull/95802
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [clang-tools-extra] Reland [clang][Sema, Lex, Parse] Preprocessor embed in C and C++ (PR #95802)

2024-06-19 Thread Jakub Jelínek via cfe-commits


@@ -0,0 +1,98 @@
+// RUN: %clang_cc1 %s -fsyntax-only --embed-dir=%S/Inputs -verify=expected,cxx 
-Wno-c23-extensions
+// RUN: %clang_cc1 -x c -std=c23 %s -fsyntax-only --embed-dir=%S/Inputs 
-verify=expected,c
+#embed 
+;
+
+void f (unsigned char x) { (void)x;}
+void g () {}
+void h (unsigned char x, int y) {(void)x; (void)y;}
+int i () {
+   return
+#embed 
+   ;
+}
+
+_Static_assert(
+#embed  suffix(,)
+""
+);
+_Static_assert(
+#embed 
+, ""
+);
+_Static_assert(sizeof(
+#embed 
+) ==
+sizeof(unsigned char)
+, ""
+);
+_Static_assert(sizeof
+#embed 
+, ""
+);
+_Static_assert(sizeof(
+#embed  // expected-warning {{left operand of comma operator has no 
effect}}
+) ==
+sizeof(unsigned char)
+, ""
+);
+
+#ifdef __cplusplus
+template 
+void j() {
+   static_assert(First == 'j', "");
+   static_assert(Second == 'k', "");
+}
+#endif
+
+void do_stuff() {
+   f(
+#embed 
+   );
+   g(
+#embed 
+   );
+   h(
+#embed 
+   );
+   int r = i();
+   (void)r;
+#ifdef __cplusplus
+   j<
+#embed 
+   >(
+#embed 
+   );
+#endif
+}
+
+// Ensure that we don't accidentally allow you to initialize an unsigned char *
+// from embedded data; the data is modeled as a string literal internally, but
+// is not actually a string literal.
+const unsigned char *ptr =
+#embed  // expected-warning {{left operand of comma operator has no 
effect}}
+; // c-error@-2 {{incompatible integer to pointer conversion initializing 
'const unsigned char *' with an expression of type 'unsigned char'}} \
+ cxx-error@-2 {{cannot initialize a variable of type 'const unsigned char 
*' with an rvalue of type 'unsigned char'}}
+
+// However, there are some cases where this is fine and should work.
+const unsigned char *null_ptr_1 =
+#embed  if_empty(0)
+;
+
+const unsigned char *null_ptr_2 =
+#embed 
+;
+
+const unsigned char *null_ptr_3 = {
+#embed 
+};
+
+#define FILE_NAME 
+#define LIMIT 1
+#define OFFSET 0
+#define EMPTY_SUFFIX suffix()
+
+constexpr unsigned char ch =
+#embed FILE_NAME limit(LIMIT) clang::offset(OFFSET) EMPTY_SUFFIX
+;
+static_assert(ch == 0);

jakubjelinek wrote:

Unless you want to special case it in way too many spots, I'd think it would be 
far easier to optimize just the inner part of the integer sequence, i.e. 
everything except the first and last sequence element (maybe with the exception 
when the last prefix token is , or first suffix token is ,
Because one can use arbitrary tokens before and after the #embed, it can be
```c
const unsigned char a[] = {
-400 + 4 * 
#embed __FILE__
- 27 };
```
(or with tokens from prefix/suffix) and at least the current patchset 
mishandles many of such cases.  For the inner part of the sequence you know 
there is , before it and , after it, which simplifies a lot of things.
The above is handled correctly by GCC and by clang -save-temps, but not by 
clang without -save-temps.
And there are tons of other cases like that, e.g. even designated initializer 
[26] = 
before the sequence, etc.

https://github.com/llvm/llvm-project/pull/95802
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [clang] Inject tokens containing #embed back into token stream (PR #97274)

2024-07-16 Thread Jakub Jelínek via cfe-commits

jakubjelinek wrote:

@ThePhd @AaronBallman
And even more importantly (checking on godbolt again):
```c
int a = sizeof (
#embed __FILE__ limit (1)
);
```
is 1 in clang as well as clang++ trunk and 4 with clang/clang++ trunk with 
-save-temps.
I thought there was agreement that at least for C the literals have type int, 
for C++ it is fuzzy and depends on what will be voted in but in any case, there 
shouldn't be different code produced between integrated preprocessing and 
-save-temps.
If C++ requires some cast, it would need to make it clear what that exact cast 
is, i.e. whether it is supposed to be
```c
12,143,12,16
```
or
```c
static_cast(12),static_cast(143),static_cast(12),static_cast(16)
```
or
```c
(unsigned char)12,(unsigned char)143,(unsigned char)12,(unsigned char)16
```
or
```c
'\014','\0217','\014','\020'
```
or whatever else and the preprocessor would need to emit that cast on every 
literal (unless using some extension like #embed "." __gnu__::__base64__("...") 
that the GCC patchset uses for the inner parts of the longer sequences.
I think the exact type of cast can affect parsing of some of the expressions, 
e.g.
```c
extern long long p[];
auto foo () {
return
#embed __FILE__ limit (1)
[p];
}
```
will behave one way with the cast is (unsigned char)12 and differently if it is 
static_cast(12) or ((unsigned char)12).
Most of the other bugs I'm seeing are also about consistency, with -save-temps 
it works IMHO correctly while without it misbehaves.  The behavior has to be 
the same.

https://github.com/llvm/llvm-project/pull/97274
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [compiler-rt] [ubsan] Display correct runtime messages for negative _BitInt (PR #96240)

2024-09-02 Thread Jakub Jelínek via cfe-commits


@@ -125,6 +137,25 @@ class TypeDescriptor {
 return 1 << (TypeInfo >> 1);
   }
 
+  const char *getBitIntBitCountPointer() const {
+DCHECK(isBitIntTy());
+DCHECK(isSignedBitIntTy());
+// Scan Name for zero and return the next address
+const char *p = getTypeName();
+while (*p != '\0')
+  ++p;
+// Return the next address
+return p + 1;
+  }
+
+  unsigned getIntegerBitCount() const {
+DCHECK(isIntegerTy());
+if (isSignedBitIntTy())
+  return *reinterpret_cast(getBitIntBitCountPointer());

jakubjelinek wrote:

If it isn't performance critical, guess using internal_memcpy is an option as 
well.

https://github.com/llvm/llvm-project/pull/96240
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [compiler-rt] [ubsan] Display correct runtime messages for negative _BitInt (PR #96240)

2024-09-02 Thread Jakub Jelínek via cfe-commits

jakubjelinek wrote:

I'm not suggesting to encode the number of limbs anywhere, I'm suggesting 
encoding the bit precision of a limb somewhere.  And the limb ordering.
On little endian of bits in a limb and little endian ordering of limbs in the 
limb array, at least if the limbs are sane (have precision multiple of char 
precision and there are no padding bits in between), the actual limb precision 
might seem to be irrelevant, all you care about is the N from {,{un,}signed 
}_BitInt(N) and whether it is unsigned or signed, so you can treat the passed 
pointer say as an array of 8-bit limbs, N / 8 limbs with full 8 bits and if N % 
8, the last limb containing some further bits (in some ABIs that will be 
required to be sign or zero extended, in other ABIs the padding bits will be 
undefined, but on the libubsan side you can always treat them as undefined and 
always manually extend).
Or treat it as 16-bit limbs, or 32-bit limbs, or 64-bit limbs, or 128-bit 
limbs, for the higher perhaps with doing the limb reads again using 
internal_memcpy so that you don't impose some alignment requirement perhaps the 
target doesn't have.
But on big-endian, I think knowing the limb precision/size is already essential 
(sure, just a theory for now, GCC right only only supports _BitInt on 
little-endian targets because those are the only ones that have specified their 
ABI).
E.g. I believe _BitInt(513) big-endian with big-endian limb ordering would be 
for 32-bit limbs 17 limbs, the first one containing just one bit (the most 
significant of the whole number) and the remaining ones each 32 bits, while for 
64-bit limbs 9 limbs, the first one containing just one bit and the remaining 
ones 64 bits each; and for 128-bit limbs 5 limbs, the first one just one bit, 
the remaining 128 bits each.  You can't decode these without knowing the limb 
size, the data looks different in memory.
And then there is the possibility of big-endian limbs with little-endian 
ordering of the limbs in the array.
As the 15 bits of the current precision used e.g. for normal integers is 
clearly insufficient  to express supported BITINT_MAXWIDTH (8388608 in clang, 
65535 right now in GCC), my suggestion is to use another bit for the limb 
ordering
(say 0 little endian, 1 big endian) and the reaming 14 bits for the limb 
precision (whether log2 or not doesn't matter that much).
As for the actual _BitInt precision after the type name, one option is what you 
currently implemented, i.e. always use 32-bit integer in memory there, plus the 
extra '\0' termination if you really think it is needed, IMHO it is just a 
waste, and another
option is to use say uleb128 encoding of it.

https://github.com/llvm/llvm-project/pull/96240
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [clang] Inject tokens containing #embed back into token stream (PR #97274)

2024-07-09 Thread Jakub Jelínek via cfe-commits

jakubjelinek wrote:

I'd like to point out that most of the testcases I wrote for the GCC 
implementation still fail with the latest clang on godbolt.

https://github.com/llvm/llvm-project/pull/97274
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [clang] Inject tokens containing #embed back into token stream (PR #97274)

2024-07-09 Thread Jakub Jelínek via cfe-commits

jakubjelinek wrote:

Maybe, but I've add quite a few new testcases too...

https://github.com/llvm/llvm-project/pull/97274
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [clang] Inject tokens containing #embed back into token stream (PR #97274)

2024-07-15 Thread Jakub Jelínek via cfe-commits

jakubjelinek wrote:

I meant also e.g. basic tests like gcc/testsuite/c-c++-common/cpp/embed-20.c in
https://gcc.gnu.org/pipermail/gcc-patches/2024-July/657053.html
(and various others too).  Just checked that one and it still fails on latest 
clang trunk using godbolt.

https://github.com/llvm/llvm-project/pull/97274
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [compiler-rt] [ubsan] Display correct runtime messages for negative _BitInt (PR #96240)

2024-08-23 Thread Jakub Jelínek via cfe-commits


@@ -125,6 +137,25 @@ class TypeDescriptor {
 return 1 << (TypeInfo >> 1);
   }
 
+  const char *getBitIntBitCountPointer() const {
+DCHECK(isBitIntTy());
+DCHECK(isSignedBitIntTy());
+// Scan Name for zero and return the next address
+const char *p = getTypeName();
+while (*p != '\0')
+  ++p;
+// Return the next address
+return p + 1;
+  }
+
+  unsigned getIntegerBitCount() const {
+DCHECK(isIntegerTy());
+if (isSignedBitIntTy())
+  return *reinterpret_cast(getBitIntBitCountPointer());

jakubjelinek wrote:

This will not work on strict alignment targets.  I'd suggest instead of trying 
to reinterpret the bits memcpy them into u32 and return that, the compilers 
will optimize that properly.

https://github.com/llvm/llvm-project/pull/96240
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [compiler-rt] [ubsan] Display correct runtime messages for negative _BitInt (PR #96240)

2024-08-23 Thread Jakub Jelínek via cfe-commits


@@ -103,6 +103,13 @@ class TypeDescriptor {
 /// representation is that of bitcasting the floating-point value to an
 /// integer type.
 TK_Float = 0x0001,
+/// An _BitInt(N) type. Lowest bit is 1 for a signed value, 0 for an
+/// unsigned value. Remaining bits are log_2(bit_width). The value

jakubjelinek wrote:

bit_width of what?  The main problem is that the _BitInt ABI is only defined in 
the x86-64 and aarch64 psABIs, for i686 the word was that the ABI should be 
whatever GCC implemented, but nobody updated the ia32 psABI yet.
I think ARM 32-bit is defined too, some other targets are working on it.
As documented in LLVM, it implements _BitInt everywhere but notes that the ABI 
can change (GCC took a different approach, only implementing it for arches 
which agree on an ABI).
So, if compiler-rt implements _BitInt support, it better be able to encode the 
various ABIs which will be used for _BitInt on various arches.
The current ABIs usually have some special cases for smaller number of bits, 
but those generally can be encoded as normal integer (sign or zero extended say 
to long long or __int128).
For the large _BitInts, x86_64 uses little endian ordered array of 64-bit 
limbs, aarch64 little or big endian ordered array of 128-bit limbs, ia32 little 
endian ordered array of 32-bit limbs.
But I was told the powerpc folks are considering little endian ordering of 
limbs on both ppc64le and ppc64 big-endian.
So, I think you certainly want to encode the limb size and the N from 
_BitInt(N) and whether it is signed or unsigned and possibly (e.g. for ppc64) 
also whether the limbs are ordered from least significant to most or vice versa 
(bit ordering within limb would be normally dependent on endianity).
So, I think that bit_width in the bits above sign should be the limb size, not 
_BitInt size rounded up or whatever you're implementing right now.  And encode 
somewhere the endianity too.

https://github.com/llvm/llvm-project/pull/96240
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits