tahonermann added inline comments.
================
Comment at: clang/include/clang/AST/Expr.h:1846-1871
+ std::string getStringAsChar() const {
+ assert(getCharByteWidth() == 1 &&
+ "This function is used in places that assume strings use char");
+ return std::string(getTrailingObjects<char>(), getTrailingObjects<char>()
+ getByteLength());
+ }
+
+ std::u16string getStringAsChar16() const {
----------------
MarcusJohnson91 wrote:
> cor3ntin wrote:
> > aaron.ballman wrote:
> > > One potential issue to this is that in C++, these types are defined to be
> > > UTF-16 and UTF-32, whereas in C, that isn't always the case. Currently,
> > > Clang defines `__STDC_UTF_16__` and `__STDC_UTF_32__` on all targets, but
> > > we may need to add some sanity checks to catch if a target overrides that
> > > behavior. Currently, all the targets in Clang are keeping that promise,
> > > but I have no idea what shenanigans downstream users get up to and
> > > whether their targets remove the macro definition or define it to `0`
> > > instead of `1`.
> > Is it possible that the host and target wchar_t have a different size here?
> I've honestly been wondering how Clang handled that, in the codebase vs at
> runtime myself for a while.
WG14 N2728 (http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2728.htm) proposes
specifying that `char16_t` and `char32_t` string literals are UTF-16 and UTF-32
encoded respectively. SG16 previously tried to identify any implementations
that use a different encoding for these and was unable to find one. I think it
is safe to only support these encodings for `u` and `U` prefixed string
literals.
================
Comment at: clang/lib/AST/ScanfFormatString.cpp:344-347
+ case LengthModifier::AsUTF16:
+ return ArgType(ArgType::Char16Ty);
+ case LengthModifier::AsUTF32:
+ return ArgType(ArgType::Char32Ty);
----------------
Should these additions not include a `PtrTo` wrapper as is done for other types?
================
Comment at: clang/lib/AST/Type.cpp:1980-2002
+bool Type::isChar16Type(const LangOptions &LangOpts) const {
if (const auto *BT = dyn_cast<BuiltinType>(CanonicalType))
- return BT->getKind() == BuiltinType::Char16;
- return false;
-}
-
-bool Type::isChar32Type() const {
- if (const auto *BT = dyn_cast<BuiltinType>(CanonicalType))
- return BT->getKind() == BuiltinType::Char32;
- return false;
-}
-
-/// Determine whether this type is any of the built-in character
-/// types.
-bool Type::isAnyCharacterType() const {
- const auto *BT = dyn_cast<BuiltinType>(CanonicalType);
- if (!BT) return false;
- switch (BT->getKind()) {
- default: return false;
- case BuiltinType::Char_U:
- case BuiltinType::UChar:
- case BuiltinType::WChar_U:
- case BuiltinType::Char8:
- case BuiltinType::Char16:
- case BuiltinType::Char32:
- case BuiltinType::Char_S:
- case BuiltinType::SChar:
- case BuiltinType::WChar_S:
- return true;
+ if (BT->getKind() == BuiltinType::Char16)
+ return true;
+ if (!LangOpts.CPlusPlus) {
+ return isType("char16_t");
}
----------------
MarcusJohnson91 wrote:
> aaron.ballman wrote:
> > MarcusJohnson91 wrote:
> > > aaron.ballman wrote:
> > > > If I understand properly, one issue is that `char16_t` is defined by C
> > > > to be *the same type* as `uint_least16_t`, which itself is a typedef to
> > > > some other integer type. So we can't make `char16_t` be a distinct type
> > > > in C, it has to be a typedef to a typedef to an integer type.
> > > >
> > > > One concern I have about the approach in this patch is that it makes
> > > > the type system a bit more convoluted. This type is *not* really a
> > > > character type in C, it's a typedef type that playacts as a character
> > > > type. I think this is a pretty fundamental change to the type system in
> > > > the compiler in some ways because this query is no longer about the
> > > > canonical type but about the semantics of how the type is expected to
> > > > be used.
> > > >
> > > > I'd definitely like to hear what @rsmith thinks of this approach.
> > > I see your point, but I'm not sure how else we could get it through the
> > > type checking system without doing it this way?
> > >
> > > I tried to be careful to only allow the actual character typedefs through
> > > by making sure char16_t or char32_t is in the typedef chain, and only in
> > > C mode.
> > > I see your point, but I'm not sure how else we could get it through the
> > > type checking system without doing it this way?
> >
> > I am thinking that rather than doing this work as part of the `Type`
> > object, the extra work can be done from the format string checker. This
> > keeps the confusion about types localized to the part of the code that
> > needs to care about it, rather than pushing it onto all consumers of the
> > type system.
> Hmmmm, that's a good idea, but I'm not sure how to do that, let me think on
> it for a bit.
What does the format string checker do for an example like `printf("%Ls",
L"text")` today?
================
Comment at: clang/lib/Sema/SemaChecking.cpp:9534
S, inFunctionCall, Args[format_idx],
- S.PDiag(diag::warn_format_string_is_wide_literal),
FExpr->getBeginLoc(),
+ S.PDiag(diag::err_format_invalid_type), FExpr->getBeginLoc(),
/*IsStringLocation*/ true, OrigFormatExpr->getSourceRange());
----------------
Perhaps Clang should warn about use of the new extension in strict conformance
modes.
CHANGES SINCE LAST ACTION
https://reviews.llvm.org/D103426/new/
https://reviews.llvm.org/D103426
_______________________________________________
cfe-commits mailing list
[email protected]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits