MarcusJohnson91 added inline comments.
================
Comment at: clang/include/clang/AST/Expr.h:1846-1871
+ std::string getStringAsChar() const {
+ assert(getCharByteWidth() == 1 &&
+ "This function is used in places that assume strings use char");
+ return std::string(getTrailingObjects<char>(), getTrailingObjects<char>()
+ getByteLength());
+ }
+
+ std::u16string getStringAsChar16() const {
----------------
cor3ntin wrote:
> aaron.ballman wrote:
> > One potential issue to this is that in C++, these types are defined to be
> > UTF-16 and UTF-32, whereas in C, that isn't always the case. Currently,
> > Clang defines `__STDC_UTF_16__` and `__STDC_UTF_32__` on all targets, but
> > we may need to add some sanity checks to catch if a target overrides that
> > behavior. Currently, all the targets in Clang are keeping that promise, but
> > I have no idea what shenanigans downstream users get up to and whether
> > their targets remove the macro definition or define it to `0` instead of
> > `1`.
> Is it possible that the host and target wchar_t have a different size here?
I've honestly been wondering how Clang handled that, in the codebase vs at
runtime myself for a while.
================
Comment at: clang/include/clang/AST/FormatString.h:83
AsMAllocate, // for '%ms', GNU extension to scanf
+ AsUTF16, // for '%l16(c|s)', soon to be standardized
+ AsUTF32, // for '%l32(c|s)', soon to be standardized
----------------
aaron.ballman wrote:
> May want to drop the "soon to be standardized" given that the proposal hasn't
> been seen by WG14 yet. I think it's fine to say "Clang extension", though.
> More on the format specifier itself, below.
Done,
and here is the link to the document, I haven't heard any feedback?
http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2761.pdf
================
Comment at: clang/include/clang/AST/Type.h:1975
bool isBooleanType() const;
+ bool isType(const std::string TypeName) const;
bool isCharType() const;
----------------
aaron.ballman wrote:
> It's not clear from the header file what this function actually *does*.
> Presumably, everything that can be represented by a `Type` is a type, so I'd
> assume this returns `true` always. Or should this be a static function?
Yeah, could choose a better name just not sure what.
It takes a type name as the argument, and then it desugars the type one step at
a time, and if it finds a match it returns true.
so, let's say we're in C++, and someone typedef'd char8_t to String.
This function will say yes, String is compatible with char8_t for example.
it's mostly for C mode's typedef's of char16_t and char32_t
================
Comment at: clang/lib/AST/Expr.cpp:1197
unsigned *StartTokenByteOffset) const {
- assert((getKind() == StringLiteral::Ascii ||
- getKind() == StringLiteral::UTF8) &&
----------------
aaron.ballman wrote:
> I don't see changes that make this assertion valid to remove -- have I missed
> something?
Nope you didn't miss anything, I did.
this is a remnant from when I was trying to templatize all the format checking
code instead of converting the format strings.
Restored the assert.
================
Comment at: clang/lib/AST/FormatString.cpp:235-240
+ if (LO.C2x && I + 1 != E && I[0] == '1' && I[1] == '6') {
+ ++I;
+ ++I;
+ lmKind = LengthModifier::AsUTF16;
+ break;
+ } else if (LO.C2x && I + 1 != E && I[0] == '3' && I[1] == '2') {
----------------
aaron.ballman wrote:
> I don't think this is a conforming extension to C -- lowercase length
> modifiers and conversion specifiers are reserved for the standard, and
> uppercase length modifiers and conversion specifiers are reserved for the
> implementation. `l16` starts with a lowercase letter, so it's reserved for
> the standard.
>
> Note: WG14 has been considering extensions in a closely-related space
> (integer types rather than string or character types) that you may be
> interested in, if you're not already aware of it:
> http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2511.pdf. This proposal was
> not adopted, but did have strong sentiment for something along these lines.
Yeah, honestly not really sure what to do about the reservation, L is also used
for wide characters and some integers too, and (U|u)(16|32) wasn't taken well
by the community.
================
Comment at: clang/lib/AST/OSLog.cpp:212
+ } else if (Lit->isUTF16()) {
+ std::wstring_convert<std::codecvt_utf8_utf16<char16_t>, char16_t> Convert;
+ std::u16string U16 = Lit->getStringAsChar16();
----------------
aaron.ballman wrote:
> MarcusJohnson91 wrote:
> > aaron.ballman wrote:
> > > cor3ntin wrote:
> > > > I'm not sure I have a better suggestion but `codecvt_utf8_utf16` is
> > > > deprecated in C++17
> > > Good point -- this likely should be lifted into a common interface (as it
> > > appears several times in the patch) and make use of the existing LLVM UTF
> > > conversion functionality:
> > > https://github.com/intel/llvm/blob/sycl/llvm/include/llvm/Support/ConvertUTF.h
> > The ConvertUTF version you linked contains `convertUTF32toUTF8String` but
> > the one in LLVM does not, what's the process for updating this?
> Oh, oops, I managed to grab a link from downstream and not LLVM, sorry about
> that! But the downstream and LLVM both have a method to perform the
> conversion
> https://github.com/llvm/llvm-project/blob/main/llvm/include/llvm/Support/ConvertUTF.h#L162,
> so you could add convenience functions to that file that wraps the API, as
> part of this patch.
I'll probably write a wrapper then
================
Comment at: clang/lib/AST/PrintfFormatString.cpp:254-257
+ if (ParseLengthModifier(FS, I, E, LO) && I == E) { // THIS IS WHERE IT
FREAKS OUT
// No more characters left?
if (Warn)
+ printf("ParsePrintfSpecifier: Incomplete Specifier 8\n");
----------------
aaron.ballman wrote:
> Surprising comment and what looks to be some debugging code?
Removed
================
Comment at: clang/lib/AST/PrintfFormatString.cpp:635
}
- if (LM.getKind() == LengthModifier::AsWide)
- return ArgType(ArgType::WCStrTy, "wchar_t *");
----------------
aaron.ballman wrote:
> It looks like this coverage got lost?
Restored
================
Comment at: clang/lib/AST/PrintfFormatString.cpp:844
case BuiltinType::SChar:
+ case BuiltinType::Char8:
LM.setKind(LengthModifier::AsChar);
----------------
aaron.ballman wrote:
> Is there a reason why you're adding this support for `char16_t` and
> `char32_t` but not `char8_t` aside from "C doesn't have `char8_t`?" If that's
> the sole reason, you might be interested in tracking:
> http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2653.htm
Added char8_t
================
Comment at: clang/lib/AST/Type.cpp:1962
+bool Type::isType(const std::string TypeName) const {
+ QualType Desugar = this->getLocallyUnqualifiedSingleStepDesugaredType();
----------------
aaron.ballman wrote:
> Oh, I see now that this is doing a name comparison against the type -- that's
> not a good API in general because it's *really* hard to guess at what the
> type will come out as textually. e.g., `class` and `struct` keywords are
> interchangeable in C++, C sometimes gets confused with `bool` vs `_Bool`,
> template arguments sometimes matter, nested name specifiers, etc. Callers of
> this API will have to guess at these details and the printing of the type may
> change over time (e.g., C may switch from `_Bool` to `bool` and then code
> calling `isType("_Bool")` may react poorly to the change).
>
> I think we need to avoid this sort of API on `Type`.
I see your point, I reverted the behavior back to doing the desugaring in just
isChar16Type and isChar32Type
================
Comment at: clang/lib/AST/Type.cpp:1980-2002
+bool Type::isChar16Type(const LangOptions &LangOpts) const {
if (const auto *BT = dyn_cast<BuiltinType>(CanonicalType))
- return BT->getKind() == BuiltinType::Char16;
- return false;
-}
-
-bool Type::isChar32Type() const {
- if (const auto *BT = dyn_cast<BuiltinType>(CanonicalType))
- return BT->getKind() == BuiltinType::Char32;
- return false;
-}
-
-/// Determine whether this type is any of the built-in character
-/// types.
-bool Type::isAnyCharacterType() const {
- const auto *BT = dyn_cast<BuiltinType>(CanonicalType);
- if (!BT) return false;
- switch (BT->getKind()) {
- default: return false;
- case BuiltinType::Char_U:
- case BuiltinType::UChar:
- case BuiltinType::WChar_U:
- case BuiltinType::Char8:
- case BuiltinType::Char16:
- case BuiltinType::Char32:
- case BuiltinType::Char_S:
- case BuiltinType::SChar:
- case BuiltinType::WChar_S:
- return true;
+ if (BT->getKind() == BuiltinType::Char16)
+ return true;
+ if (!LangOpts.CPlusPlus) {
+ return isType("char16_t");
}
----------------
aaron.ballman wrote:
> MarcusJohnson91 wrote:
> > aaron.ballman wrote:
> > > If I understand properly, one issue is that `char16_t` is defined by C to
> > > be *the same type* as `uint_least16_t`, which itself is a typedef to some
> > > other integer type. So we can't make `char16_t` be a distinct type in C,
> > > it has to be a typedef to a typedef to an integer type.
> > >
> > > One concern I have about the approach in this patch is that it makes the
> > > type system a bit more convoluted. This type is *not* really a character
> > > type in C, it's a typedef type that playacts as a character type. I think
> > > this is a pretty fundamental change to the type system in the compiler in
> > > some ways because this query is no longer about the canonical type but
> > > about the semantics of how the type is expected to be used.
> > >
> > > I'd definitely like to hear what @rsmith thinks of this approach.
> > I see your point, but I'm not sure how else we could get it through the
> > type checking system without doing it this way?
> >
> > I tried to be careful to only allow the actual character typedefs through
> > by making sure char16_t or char32_t is in the typedef chain, and only in C
> > mode.
> > I see your point, but I'm not sure how else we could get it through the
> > type checking system without doing it this way?
>
> I am thinking that rather than doing this work as part of the `Type` object,
> the extra work can be done from the format string checker. This keeps the
> confusion about types localized to the part of the code that needs to care
> about it, rather than pushing it onto all consumers of the type system.
Hmmmm, that's a good idea, but I'm not sure how to do that, let me think on it
for a bit.
================
Comment at: clang/lib/Lex/LiteralSupport.cpp:1883-1888
if (SpellingPtr[0] == 'u' && SpellingPtr[1] == '8')
SpellingPtr += 2;
-
- assert(SpellingPtr[0] != 'L' && SpellingPtr[0] != 'u' &&
- SpellingPtr[0] != 'U' && "Doesn't handle wide or utf strings yet");
+
+ // Handle UTF-16 strings
+ if (SpellingPtr[0] == 'u' && SpellingPtr[1] != '8')
+ SpellingPtr += 1;
----------------
aaron.ballman wrote:
>
Did you have anything to say here?
================
Comment at:
clang/lib/StaticAnalyzer/Checkers/CheckSecuritySyntaxOnly.cpp:791-808
+
+ StringLiteral::StringKind Kind = FormatString->getKind();
+
+ if (Kind == StringLiteral::Ascii || Kind == StringLiteral::UTF8) {
+ String = FormatString->getStringAsChar();
+ } else if (Kind == StringLiteral::UTF16) {
+ std::wstring_convert<std::codecvt_utf8_utf16<char16_t>, char16_t>
Convert;
----------------
cor3ntin wrote:
> I think this code is present in multiple places in your PR, could it be
> deduplicated?
Yup, I've been working on that too; This was actually the first thing I started
with.
Repository:
rG LLVM Github Monorepo
CHANGES SINCE LAST ACTION
https://reviews.llvm.org/D103426/new/
https://reviews.llvm.org/D103426
_______________________________________________
cfe-commits mailing list
[email protected]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits