[clang] [clang-tools-extra] Reland [clang][Sema, Lex, Parse] Preprocessor embed in C and C++ (PR #95802)
jakubjelinek wrote: @ThePhD @AaronBallman @cor3ntin Joseph Myers raised an interesting question whether the tokens in embed-parameter-sequence are macro expanded or not. Consider ```c #define FILE "/etc/passwd" #define LIMIT limit(1) #define THIS , 1, 2, 3 #define PRE prefix (42, ONE #embed FILE LIMIT suffix(THIS) PRE ) TWO #embed "/etc/passwd" LIMIT suffix(THIS) PRE ) THREE #define limit prefix #embed "/etc/passwd" limit (4) suffix (THIS) ``` The first #embed is I hope clear in that it is the #embed pp-tokens new-line case where everything gets macro expanded. The second case is less clear, the filename part matches the " q-char-sequence " case, but LIMIT suffix(THIS) PRE ) is not valid embed-parameter-sequence, so shouldn't it be expanded too? And the last case by strict reading shouldn't be macro expanded because it is valid embed-parameter-sequence, still both the the.phd branch and clang trunk and also my GCC patchset handle it as prefix (4) suffix(, 1, 2, 3). Would in that reading #embed "/etc/passwd" LIMIT suffix(THIS) be valid too and thus not macro expanded and thus later invalid? And, if embed-parameter-sequence tokens are macro expanded only sometimes, what should happen say with #define ARG 2) if_empty (1 #embed "file" limit (ARG) ? On ```c #define A "/etc/passwd" limit (1) ) + (0 #if __has_embed (A) int i; #endif ``` also all 3 compilers agree and happily use the closing ) from the macro to use the ) from the macro expansion for the closing ) of __has_embed and continue through to the rest of the expression. Or is __has_embed supposed to be always macro expanded? https://github.com/llvm/llvm-project/pull/95802 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [clang-tools-extra] Reland [clang][Sema, Lex, Parse] Preprocessor embed in C and C++ (PR #95802)
@@ -0,0 +1,98 @@ +// RUN: %clang_cc1 %s -fsyntax-only --embed-dir=%S/Inputs -verify=expected,cxx -Wno-c23-extensions +// RUN: %clang_cc1 -x c -std=c23 %s -fsyntax-only --embed-dir=%S/Inputs -verify=expected,c +#embed +; + +void f (unsigned char x) { (void)x;} +void g () {} +void h (unsigned char x, int y) {(void)x; (void)y;} +int i () { + return +#embed + ; +} + +_Static_assert( +#embed suffix(,) +"" +); +_Static_assert( +#embed +, "" +); +_Static_assert(sizeof( +#embed +) == +sizeof(unsigned char) +, "" +); +_Static_assert(sizeof +#embed +, "" +); +_Static_assert(sizeof( +#embed // expected-warning {{left operand of comma operator has no effect}} +) == +sizeof(unsigned char) +, "" +); + +#ifdef __cplusplus +template +void j() { + static_assert(First == 'j', ""); + static_assert(Second == 'k', ""); +} +#endif + +void do_stuff() { + f( +#embed + ); + g( +#embed + ); + h( +#embed + ); + int r = i(); + (void)r; +#ifdef __cplusplus + j< +#embed + >( +#embed + ); +#endif +} + +// Ensure that we don't accidentally allow you to initialize an unsigned char * +// from embedded data; the data is modeled as a string literal internally, but +// is not actually a string literal. +const unsigned char *ptr = +#embed // expected-warning {{left operand of comma operator has no effect}} +; // c-error@-2 {{incompatible integer to pointer conversion initializing 'const unsigned char *' with an expression of type 'unsigned char'}} \ + cxx-error@-2 {{cannot initialize a variable of type 'const unsigned char *' with an rvalue of type 'unsigned char'}} + +// However, there are some cases where this is fine and should work. +const unsigned char *null_ptr_1 = +#embed if_empty(0) +; + +const unsigned char *null_ptr_2 = +#embed +; + +const unsigned char *null_ptr_3 = { +#embed +}; + +#define FILE_NAME +#define LIMIT 1 +#define OFFSET 0 +#define EMPTY_SUFFIX suffix() + +constexpr unsigned char ch = +#embed FILE_NAME limit(LIMIT) clang::offset(OFFSET) EMPTY_SUFFIX +; +static_assert(ch == 0); jakubjelinek wrote: Unless you want to special case it in way too many spots, I'd think it would be far easier to optimize just the inner part of the integer sequence, i.e. everything except the first and last sequence element (maybe with the exception when the last prefix token is , or first suffix token is , Because one can use arbitrary tokens before and after the #embed, it can be ```c const unsigned char a[] = { -400 + 4 * #embed __FILE__ - 27 }; ``` (or with tokens from prefix/suffix) and at least the current patchset mishandles many of such cases. For the inner part of the sequence you know there is , before it and , after it, which simplifies a lot of things. The above is handled correctly by GCC and by clang -save-temps, but not by clang without -save-temps. And there are tons of other cases like that, e.g. even designated initializer [26] = before the sequence, etc. https://github.com/llvm/llvm-project/pull/95802 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [clang] Inject tokens containing #embed back into token stream (PR #97274)
jakubjelinek wrote: @ThePhd @AaronBallman And even more importantly (checking on godbolt again): ```c int a = sizeof ( #embed __FILE__ limit (1) ); ``` is 1 in clang as well as clang++ trunk and 4 with clang/clang++ trunk with -save-temps. I thought there was agreement that at least for C the literals have type int, for C++ it is fuzzy and depends on what will be voted in but in any case, there shouldn't be different code produced between integrated preprocessing and -save-temps. If C++ requires some cast, it would need to make it clear what that exact cast is, i.e. whether it is supposed to be ```c 12,143,12,16 ``` or ```c static_cast(12),static_cast(143),static_cast(12),static_cast(16) ``` or ```c (unsigned char)12,(unsigned char)143,(unsigned char)12,(unsigned char)16 ``` or ```c '\014','\0217','\014','\020' ``` or whatever else and the preprocessor would need to emit that cast on every literal (unless using some extension like #embed "." __gnu__::__base64__("...") that the GCC patchset uses for the inner parts of the longer sequences. I think the exact type of cast can affect parsing of some of the expressions, e.g. ```c extern long long p[]; auto foo () { return #embed __FILE__ limit (1) [p]; } ``` will behave one way with the cast is (unsigned char)12 and differently if it is static_cast(12) or ((unsigned char)12). Most of the other bugs I'm seeing are also about consistency, with -save-temps it works IMHO correctly while without it misbehaves. The behavior has to be the same. https://github.com/llvm/llvm-project/pull/97274 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [compiler-rt] [ubsan] Display correct runtime messages for negative _BitInt (PR #96240)
@@ -125,6 +137,25 @@ class TypeDescriptor { return 1 << (TypeInfo >> 1); } + const char *getBitIntBitCountPointer() const { +DCHECK(isBitIntTy()); +DCHECK(isSignedBitIntTy()); +// Scan Name for zero and return the next address +const char *p = getTypeName(); +while (*p != '\0') + ++p; +// Return the next address +return p + 1; + } + + unsigned getIntegerBitCount() const { +DCHECK(isIntegerTy()); +if (isSignedBitIntTy()) + return *reinterpret_cast(getBitIntBitCountPointer()); jakubjelinek wrote: If it isn't performance critical, guess using internal_memcpy is an option as well. https://github.com/llvm/llvm-project/pull/96240 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [compiler-rt] [ubsan] Display correct runtime messages for negative _BitInt (PR #96240)
jakubjelinek wrote: I'm not suggesting to encode the number of limbs anywhere, I'm suggesting encoding the bit precision of a limb somewhere. And the limb ordering. On little endian of bits in a limb and little endian ordering of limbs in the limb array, at least if the limbs are sane (have precision multiple of char precision and there are no padding bits in between), the actual limb precision might seem to be irrelevant, all you care about is the N from {,{un,}signed }_BitInt(N) and whether it is unsigned or signed, so you can treat the passed pointer say as an array of 8-bit limbs, N / 8 limbs with full 8 bits and if N % 8, the last limb containing some further bits (in some ABIs that will be required to be sign or zero extended, in other ABIs the padding bits will be undefined, but on the libubsan side you can always treat them as undefined and always manually extend). Or treat it as 16-bit limbs, or 32-bit limbs, or 64-bit limbs, or 128-bit limbs, for the higher perhaps with doing the limb reads again using internal_memcpy so that you don't impose some alignment requirement perhaps the target doesn't have. But on big-endian, I think knowing the limb precision/size is already essential (sure, just a theory for now, GCC right only only supports _BitInt on little-endian targets because those are the only ones that have specified their ABI). E.g. I believe _BitInt(513) big-endian with big-endian limb ordering would be for 32-bit limbs 17 limbs, the first one containing just one bit (the most significant of the whole number) and the remaining ones each 32 bits, while for 64-bit limbs 9 limbs, the first one containing just one bit and the remaining ones 64 bits each; and for 128-bit limbs 5 limbs, the first one just one bit, the remaining 128 bits each. You can't decode these without knowing the limb size, the data looks different in memory. And then there is the possibility of big-endian limbs with little-endian ordering of the limbs in the array. As the 15 bits of the current precision used e.g. for normal integers is clearly insufficient to express supported BITINT_MAXWIDTH (8388608 in clang, 65535 right now in GCC), my suggestion is to use another bit for the limb ordering (say 0 little endian, 1 big endian) and the reaming 14 bits for the limb precision (whether log2 or not doesn't matter that much). As for the actual _BitInt precision after the type name, one option is what you currently implemented, i.e. always use 32-bit integer in memory there, plus the extra '\0' termination if you really think it is needed, IMHO it is just a waste, and another option is to use say uleb128 encoding of it. https://github.com/llvm/llvm-project/pull/96240 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [clang] Inject tokens containing #embed back into token stream (PR #97274)
jakubjelinek wrote: I'd like to point out that most of the testcases I wrote for the GCC implementation still fail with the latest clang on godbolt. https://github.com/llvm/llvm-project/pull/97274 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [clang] Inject tokens containing #embed back into token stream (PR #97274)
jakubjelinek wrote: Maybe, but I've add quite a few new testcases too... https://github.com/llvm/llvm-project/pull/97274 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [clang] Inject tokens containing #embed back into token stream (PR #97274)
jakubjelinek wrote: I meant also e.g. basic tests like gcc/testsuite/c-c++-common/cpp/embed-20.c in https://gcc.gnu.org/pipermail/gcc-patches/2024-July/657053.html (and various others too). Just checked that one and it still fails on latest clang trunk using godbolt. https://github.com/llvm/llvm-project/pull/97274 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [compiler-rt] [ubsan] Display correct runtime messages for negative _BitInt (PR #96240)
@@ -125,6 +137,25 @@ class TypeDescriptor { return 1 << (TypeInfo >> 1); } + const char *getBitIntBitCountPointer() const { +DCHECK(isBitIntTy()); +DCHECK(isSignedBitIntTy()); +// Scan Name for zero and return the next address +const char *p = getTypeName(); +while (*p != '\0') + ++p; +// Return the next address +return p + 1; + } + + unsigned getIntegerBitCount() const { +DCHECK(isIntegerTy()); +if (isSignedBitIntTy()) + return *reinterpret_cast(getBitIntBitCountPointer()); jakubjelinek wrote: This will not work on strict alignment targets. I'd suggest instead of trying to reinterpret the bits memcpy them into u32 and return that, the compilers will optimize that properly. https://github.com/llvm/llvm-project/pull/96240 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [compiler-rt] [ubsan] Display correct runtime messages for negative _BitInt (PR #96240)
@@ -103,6 +103,13 @@ class TypeDescriptor { /// representation is that of bitcasting the floating-point value to an /// integer type. TK_Float = 0x0001, +/// An _BitInt(N) type. Lowest bit is 1 for a signed value, 0 for an +/// unsigned value. Remaining bits are log_2(bit_width). The value jakubjelinek wrote: bit_width of what? The main problem is that the _BitInt ABI is only defined in the x86-64 and aarch64 psABIs, for i686 the word was that the ABI should be whatever GCC implemented, but nobody updated the ia32 psABI yet. I think ARM 32-bit is defined too, some other targets are working on it. As documented in LLVM, it implements _BitInt everywhere but notes that the ABI can change (GCC took a different approach, only implementing it for arches which agree on an ABI). So, if compiler-rt implements _BitInt support, it better be able to encode the various ABIs which will be used for _BitInt on various arches. The current ABIs usually have some special cases for smaller number of bits, but those generally can be encoded as normal integer (sign or zero extended say to long long or __int128). For the large _BitInts, x86_64 uses little endian ordered array of 64-bit limbs, aarch64 little or big endian ordered array of 128-bit limbs, ia32 little endian ordered array of 32-bit limbs. But I was told the powerpc folks are considering little endian ordering of limbs on both ppc64le and ppc64 big-endian. So, I think you certainly want to encode the limb size and the N from _BitInt(N) and whether it is signed or unsigned and possibly (e.g. for ppc64) also whether the limbs are ordered from least significant to most or vice versa (bit ordering within limb would be normally dependent on endianity). So, I think that bit_width in the bits above sign should be the limb size, not _BitInt size rounded up or whatever you're implementing right now. And encode somewhere the endianity too. https://github.com/llvm/llvm-project/pull/96240 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits