[PATCH] D103426: Clang: Extend format string checking to wprintf/wscanf

Tom Honermann via Phabricator via cfe-commits Mon, 26 Jul 2021 20:38:06 -0700

tahonermann added inline comments.


================
Comment at: clang/include/clang/AST/Expr.h:1846-1871
+  std::string getStringAsChar() const {
+    assert(getCharByteWidth() == 1 &&
+           "This function is used in places that assume strings use char");
+    return std::string(getTrailingObjects<char>(), getTrailingObjects<char>() 
+ getByteLength());
+  }
+                                  
+  std::u16string getStringAsChar16() const {
----------------
MarcusJohnson91 wrote:
> cor3ntin wrote:
> > aaron.ballman wrote:
> > > One potential issue to this is that in C++, these types are defined to be 
> > > UTF-16 and UTF-32, whereas in C, that isn't always the case. Currently, 
> > > Clang defines `__STDC_UTF_16__` and `__STDC_UTF_32__` on all targets, but 
> > > we may need to add some sanity checks to catch if a target overrides that 
> > > behavior. Currently, all the targets in Clang are keeping that promise, 
> > > but I have no idea what shenanigans downstream users get up to and 
> > > whether their targets remove the macro definition or define it to `0` 
> > > instead of `1`.
> > Is it possible that the host and target wchar_t have a different size here?
> I've honestly been wondering how Clang handled that, in the codebase vs at 
> runtime myself for a while.
WG14 N2728 (http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2728.htm) proposes 
specifying that `char16_t` and `char32_t` string literals are UTF-16 and UTF-32 
encoded respectively.  SG16 previously tried to identify any implementations 
that use a different encoding for these and was unable to find one.  I think it 
is safe to only support these encodings for `u` and `U` prefixed string 
literals.


================
Comment at: clang/lib/AST/ScanfFormatString.cpp:344-347
+        case LengthModifier::AsUTF16:
+          return ArgType(ArgType::Char16Ty);
+        case LengthModifier::AsUTF32:
+          return ArgType(ArgType::Char32Ty);
----------------
Should these additions not include a `PtrTo` wrapper as is done for other types?


================
Comment at: clang/lib/AST/Type.cpp:1980-2002
+bool Type::isChar16Type(const LangOptions &LangOpts) const {
   if (const auto *BT = dyn_cast<BuiltinType>(CanonicalType))
-    return BT->getKind() == BuiltinType::Char16;
-  return false;
-}
-
-bool Type::isChar32Type() const {
-  if (const auto *BT = dyn_cast<BuiltinType>(CanonicalType))
-    return BT->getKind() == BuiltinType::Char32;
-  return false;
-}
-
-/// Determine whether this type is any of the built-in character
-/// types.
-bool Type::isAnyCharacterType() const {
-  const auto *BT = dyn_cast<BuiltinType>(CanonicalType);
-  if (!BT) return false;
-  switch (BT->getKind()) {
-  default: return false;
-  case BuiltinType::Char_U:
-  case BuiltinType::UChar:
-  case BuiltinType::WChar_U:
-  case BuiltinType::Char8:
-  case BuiltinType::Char16:
-  case BuiltinType::Char32:
-  case BuiltinType::Char_S:
-  case BuiltinType::SChar:
-  case BuiltinType::WChar_S:
-    return true;
+    if (BT->getKind() == BuiltinType::Char16)
+      return true;
+  if (!LangOpts.CPlusPlus) {
+    return isType("char16_t");
   }
----------------
MarcusJohnson91 wrote:
> aaron.ballman wrote:
> > MarcusJohnson91 wrote:
> > > aaron.ballman wrote:
> > > > If I understand properly, one issue is that `char16_t` is defined by C 
> > > > to be *the same type* as `uint_least16_t`, which itself is a typedef to 
> > > > some other integer type. So we can't make `char16_t` be a distinct type 
> > > > in C, it has to be a typedef to a typedef to an integer type.
> > > > 
> > > > One concern I have about the approach in this patch is that it makes 
> > > > the type system a bit more convoluted. This type is *not* really a 
> > > > character type in C, it's a typedef type that playacts as a character 
> > > > type. I think this is a pretty fundamental change to the type system in 
> > > > the compiler in some ways because this query is no longer about the 
> > > > canonical type but about the semantics of how the type is expected to 
> > > > be used.
> > > > 
> > > > I'd definitely like to hear what @rsmith thinks of this approach.
> > > I see your point, but I'm not sure how else we could get it through the 
> > > type checking system without doing it this way?
> > > 
> > > I tried to be careful to only allow the actual character typedefs through 
> > > by making sure char16_t or char32_t is in the typedef chain, and only in 
> > > C mode.
> > > I see your point, but I'm not sure how else we could get it through the 
> > > type checking system without doing it this way?
> > 
> > I am thinking that rather than doing this work as part of the `Type` 
> > object, the extra work can be done from the format string checker. This 
> > keeps the confusion about types localized to the part of the code that 
> > needs to care about it, rather than pushing it onto all consumers of the 
> > type system.
> Hmmmm, that's a good idea, but I'm not sure how to do that, let me think on 
> it for a bit.
What does the format string checker do for an example like `printf("%Ls", 
L"text")` today?


================
Comment at: clang/lib/Sema/SemaChecking.cpp:9534
         S, inFunctionCall, Args[format_idx],
-        S.PDiag(diag::warn_format_string_is_wide_literal), 
FExpr->getBeginLoc(),
+        S.PDiag(diag::err_format_invalid_type), FExpr->getBeginLoc(),
         /*IsStringLocation*/ true, OrigFormatExpr->getSourceRange());
----------------
Perhaps Clang should warn about use of the new extension in strict conformance 
modes.


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D103426/new/

https://reviews.llvm.org/D103426

_______________________________________________
cfe-commits mailing list
[email protected]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

[PATCH] D103426: Clang: Extend format string checking to wprintf/wscanf

Reply via email to