Re: [RFC] [C]New syntax for the argument of counted_by attribute for C language

Yeoul Na Fri, 28 Mar 2025 11:16:35 -0700


> On Mar 28, 2025, at 9:05 AM, Qing Zhao <[email protected]> wrote:
> 
> 
> 
>> On Mar 28, 2025, at 08:51, Yeoul Na <[email protected]> wrote:
>> 
>> 
>> 
>>> On Mar 27, 2025, at 9:17 AM, Qing Zhao <[email protected]> wrote:
>>> 
>>> Yeoul,
>>> 
>>> Thanks for the writeup.
>>> 
>>> So, basically, This writeup insisted on introducing a new “structure scope” 
>>> (similar as the instance scope in C++) into C language ONLY for counted_by 
>>> attribute:
>>> 
>>> 1. Inside counted_by attribute, the name lookup starts:
>>> 
>>>   A. Inside the current structure first (the NEW structure scope added to 
>>> C);
>>>   B. Then outside the structure; (other current C scopes, local scope or 
>>> global scope)
>>> 
>>> 2. When trying to reference a variable outside of the structure scope that 
>>> name_conflicts with
>>>   a structure member, a new builtin function “__builtin_global_ref” is 
>>> introduced for such 
>>>   purpose.
>>> 
>>>  ( I think that __builtin_global_ref might not accurate, because the outer 
>>> scope might be either global scope or local scope)
>> 
>> 
>> Clarification: __builtin_global_ref will see the global scope directly. This 
>> is similar to global scope resolution syntax (‘::’) in C++.
> 
> Yes, that’s my thought too. 
> 
> Then, you still need another builtin to refer to the local variable with the 
> same name as the structure member, for example, 
> In the below example, if the “len” inside the counted_by refers to the “const 
> int len = 20”, how do you specify this?
>> 
>> constexpr int len = 10;
>> 
>> void foo (void)
>> {
>>  const int len = 20;
>> 
>>  struct s {
>>    int len;
>>    int *__counted_by(__builtin_global_ref(len)) buf; // refers to global 
>> ‘len'
>>  };
>> }
>> 
>> Here are some reasons why we chose to provide a global scope resolution 
>> builtin, not a builtin to see an outer scope or just a local scope:
>> 
>> 1) The builtin is a substitute for some “scope resolution specifier”. Scope 
>> specifiers typically meant to choose a “specific" scope.
>> 2) To the best of my knowledge there is no precedence in any other C family 
>> language to provide a scope resolution for local scopes.
> 
> However, there is possibility that in the above example, the “len” might 
> refer to the local variable len, not the global one. How do you specify that?
> 
>> 3) Name conflicts with local variables can be easily renamed.
> 
> Then more source code change in different places is needed, I am not sure 
> whether this is easy to do in some cases.


The change will be only to the local variable and IMOH, renaming local variable 
would likely not a trouble in practice.

In fact, when a local variable shadows a global variable like below, there is 
currently no way to choose the global variable ‘var’ in C. The programmer 
renames their local variable in the case like this.  

int var;

void foo(void) {
  int var;
  // ... 
  var = 0; // actually wanted to use a global var
}


That said, I do not object to introducing some builtin to specify local scope 
at some point, if there is a compelling use case is found during adoption.


> 
>> 4) If we provide a builtin that selects outer scope instead, there is no way 
>> to choose a global ‘len' if it’s shadowed by a local variable, so then the 
>> member name has to be renamed anyway in order to choose a global `len`. 
> 
> Yes, that’s true. So maybe two builtins are needed?
> 
>> 5) This way, code can be written compatibly both in C and C++.
>> 
>>> 
>>> 3. Where there is confliction between counted_by and VLA such as:
>>> 
>>> constexpr int len = 10;
>>> 
>>> struct s {
>>> int len;
>>> int *__counted_by(len) buf; // refers to struct member `len`.
>>> int arr[len]; // refers to global constexpr `len`
>>> };
>>> 
>>> Issue compiler warning to user to ask the user to use __builtin_global_ref 
>>> to distinguish. 
>> 
>> Additionally, our proposal suggests __builtin_member_ref to explicitly use a 
>> member in a similar situation.
>> The builtin could be replaced by ‘__self' or some other syntax once the 
>> standard committee decides in the future, but earlier in the thread JeanHeyd 
>> pointed out that:
>> 
>> "I would like to gently push back about __self__, or __self, or self, 
>> because all of these identifiers are fairly common identifiers in code. When 
>> I writing the paper for __self_func ( 
>> https://thephd.dev/_vendor/future_cxx/papers/C%20-%20__self_func.html ), I 
>> searched GitHub and other source code indexing and repository services: 
>> __self, __self__, and self has a substantial amount of uses. If there's an 
>> alternative spelling to consider, I think that would be helpful."
>> 
>> Thus, I think instead of trying to stick to a certain syntax right now, 
>> using some builtin will allow us to easily migrate to a new syntax by 
>> guarding the current usage under a macro.
>> 
>> Writing the builtin could be cumbersome but this shall be written only when 
>> there is an ambiguity. Btw, I’m open to any other name suggestions for the 
>> builtins!
> 
> I think that it’s better to stick to one approach:

Just to clarify by "to stick to a certain syntax”, I meant there seems to be no 
agreement which syntax to use  “__self.”, “.”, "this->”, etc for the member 
scope resolution syntax. JeanHeyd pointed out earlier that the spelling 
“__self” has a substantial amount of uses already in C code. And dot has other 
problems too. It seems it’s gonna take a while to choose the right one. And we 
might have to change yet again when the C standard committee decides on some 
other syntax.  

Therefore, I think using builtins instead will help because it can be 
macro-defined easily.

> 
> A. Add a new keyword “__self”/ or __builtin_self() to explicitly refer to the 
> member variable, keep all other no changes. 
> 
> OR:
> 
> A. Add one new instance scope into C, lookup the name inside the new scope 
> first, then outer scope. If try to refer to variables outside the instance 
> scope, using new added “scope resolution specifier”, such as 
> __builtin_global_… __builtin_local_… for that purpose.
>     For A, fixing the VLA inside structure to have the same lookup rule as 
> counted-by. 
> 
> 
> Anything mixing these two is not good to me...

Ok, I would be fine with the second option.

>> 
>>> 
>>> Are the above the correct understanding of your writeup?
>> 
>> Yes, it’s mostly correct, except some clarifications I made above. Thank you!
> 
> Thank you for the clarifications.
> 
> Qing
>> 
>>> 
>>> 
>>> From my understanding:
>>> 
>>> 1. This design started from the C++’s point of view by adding a new 
>>> “structure scope” to C;
>>> 2. This design conflicts with the current VLA default scope rule (which 
>>> based on the default C scopes) in C.
>>>    In the above example that mixes counted_by and VLA, it’s so weird that  
>>> there are two difference name
>>>    lookup rules inside the same structure. 
>>>    It’s clearly a design bug. Either VLA or counted_by need to be fixed to 
>>> make them consistent. 
>>> 
>>> 
>>> I personally do not completely object to introduce a new “structure scope” 
>>> into C, but it’s so hard for me to accept
>>> that there are two different name lookup rules inside the same structure: 
>>> one rule for VLA, another rule for counted_by
>>> attribute.  (If introducing a new “structure scope” to C,  I think it’s 
>>> better to change VLA to “structure scope” too, not sure
>>> whether this is feasible or not)
>>> 
>>> I still think that introduce a new keyword “__self” for referring member 
>>> variable inside structure without adding 
>>> a new “structure scope" should be the best approach to resolve this issue 
>>> in C. 
>>> 
>>> However, I am really hoping that the discussion can be converged soon. So, 
>>> I am okay with adding a new “structure scope”
>>> If most of people agreed on that approach. 
>> 
>> Thanks for the flexibility!
>> 
>>> 
>>> Qing
>>> 
>>> 
>>>> On Mar 26, 2025, at 12:59, Yeoul Na <[email protected]> wrote:
>>>> 
>>>> Hi all,
>>>> 
>>>> Thanks for all the discussions.
>>>> 
>>>> I posted the design rationale for our current approach in 
>>>> https://discourse.llvm.org/t/rfc-forward-referencing-a-struct-member-within-bounds-annotations/85510.
>>>>  This clarifies some of the questions that are asked in this thread. The 
>>>> document also proposes diagnostics to mitigate potential ambiguity, and 
>>>> propose new builtins that can be used as a suppression and disambiguation 
>>>> mechanism.
>>>> 
>>>> Best regards,
>>>> Yeoul
>>>> 
>>>>> On Mar 26, 2025, at 9:11 AM, Yeoul Na <[email protected]> wrote:
>>>>> 
>>>>> Sorry for the delay.
>>>>> 
>>>>> I’m planning on sending out our design rationale of the current approach 
>>>>> without the new syntax today.
>>>>> 
>>>>> - Yeoul
>>>>> 
>>>>>> On Mar 14, 2025, at 9:22 PM, John McCall <[email protected]> wrote:
>>>>>> 
>>>>>> On 14 Mar 2025, at 15:18, Martin Uecker wrote:
>>>>>> Am Freitag, dem 14.03.2025 um 14:42 -0400 schrieb John McCall:
>>>>>> On 14 Mar 2025, at 14:13, Martin Uecker wrote:
>>>>>> Am Freitag, dem 14.03.2025 um 10:11 -0700 schrieb David Tarditi:
>>>>>> Hi Martin,
>>>>>> The C design of VLAs misunderstood dependent typing.
>>>>>> They probably did not care about theory, but the design is 
>>>>>> not inconsistent with theory.
>>>>>> This is almost true, but for bad reasons. The theory of dependent types 
>>>>>> is heavily concerned with deciding whether two types are the same, and C 
>>>>>> simply sidesteps this question because type identity is largely 
>>>>>> meaningless in C. Every value of variably-modified type is (or decays 
>>>>>> to) a pointer, and all pointers in C freely convert to one another 
>>>>>> (within the object/function categories). _Generic is based on type 
>>>>>> compatibility, not equality. So in that sense, the standard doesn’t say 
>>>>>> anything inconsistent with theory because it doesn’t even try to say 
>>>>>> anything.
>>>>>> The reason it is not quite true is that C does have rules for compatible 
>>>>>> and composite types, and alas, those rules for variably-modified types 
>>>>>> are not consistent with theory. Two VLA types of compatible element type 
>>>>>> are always statically considered compatible, and it’s simply UB if the 
>>>>>> sizes aren’t the same. The composite type of a VLA and a fixed-size 
>>>>>> array type is always the fixed-size array type. The standard is 
>>>>>> literally incomplete about the composite type of two VLAs; if you use a 
>>>>>> ternary operator where both operands are casts to VLA types, the 
>>>>>> standard just says it’s straight-up just undefined behavior (because one 
>>>>>> of the types has a bound that’s unevaluated) and doesn’t even bother 
>>>>>> telling us what the static type is supposed to be.
>>>>>> Yes, I guess this is all true.
>>>>>> But let's rephrase my point a bit more precisely: One could take 
>>>>>> a strict subset of C that includes variably modified types but 
>>>>>> obviously has to forbid a lot other things (e.g. arbitrary pointer 
>>>>>> conversions or unsafe down-casts and much more) and make this a 
>>>>>> memory-safe language with dependent types. This would also 
>>>>>> require adding run-time checks at certain places where there 
>>>>>> is now UB, in particular where two VLA types need to be compatible.
>>>>>> Mmm. You can certainly subset C to the point that it’s memory-safe, but
>>>>>> it wouldn’t really be anything like C anymore. As long as C has a heap,
>>>>>> I don’t see any path to achieving temporal safety without significant
>>>>>> extensions to the language. But if we’re just talking about spatial 
>>>>>> safety,
>>>>>> then sure, that could be a lot closer to C today.
>>>>>> Is that your vision, then, that you’d like to see the same sort of checks
>>>>>> that -fbounds-safety does, but you want them based firmly in the language
>>>>>> as a dynamic check triggered by pointer type conversion, with bounds
>>>>>> specified using variably-modified types? It’s a pretty elegant vision, 
>>>>>> and
>>>>>> I can see the attraction. It has some real merits, which I’ll get to 
>>>>>> below.
>>>>>> I do see at least two significant challenges, though.
>>>>>> The first and biggest problem is that, in general, array bounds can only 
>>>>>> be
>>>>>> expressed on a pointer value if it’s got pointer to array type. Most C 
>>>>>> array
>>>>>> code today works primarily with pointers to elements; programmers just 
>>>>>> use
>>>>>> array types to create concrete arrays, and they very rarely use pointers 
>>>>>> to
>>>>>> array type at all. There are a bunch of reasons for that:
>>>>>>   • Pointers to arrays have to be dereferenced twice: (*ptr)[idx] instead
>>>>>> of ptr[idx].
>>>>>>   • That makes them more error-prone, because it is easy to do pointer
>>>>>> arithmetic at the wrong level, e.g. by writing ptr[idx], which will
>>>>>> stride by multiples of the entire array size. That may even pass the
>>>>>> compiler without complaint because of C’s laxness about conversions.
>>>>>>   • Keeping the bound around in the pointer type is more work and 
>>>>>> doesn’t do
>>>>>> anything useful right now.
>>>>>>   • A lot of C programmers dislike nested declarator syntax and can’t 
>>>>>> remember
>>>>>> how it works. Those of us who can write it off the top of our heads are
>>>>>> quite atypical.
>>>>>> Now, there is an exception: you can write a parameter using an array 
>>>>>> type,
>>>>>> and it actually declares a pointer parameter. You could imagine using 
>>>>>> this
>>>>>> as a syntax for an enforceable array bound for arguments, although the
>>>>>> committee did already decide that these bounds were meaningless without
>>>>>> static. Unfortunately, you can’t do this in any other position and still
>>>>>> end up with just a pointer, so it’s not helpful as a general syntax for
>>>>>> associating bounds with pointers.
>>>>>> The upshot is that this isn’t really something people can just adopt by
>>>>>> adding annotations. It’s not just a significant rewrite, it’s a rewrite 
>>>>>> that
>>>>>> programmers will have very legitimate objections to. I think that makes 
>>>>>> this
>>>>>> at best a complement to the “sidecar” approach taken by -fbounds-safety
>>>>>> where we can track top-level bounds to a specific pointer value.
>>>>>> The second problem is that there are some extralingual problems that
>>>>>> -fbounds-safety has to solve around bounds that aren’t just local
>>>>>> evaluations of bounds expressions, and a type-conversion-driven approach
>>>>>> doesn’t help with any of them.
>>>>>> As you mentioned, the design of variably-modified types is based on
>>>>>> evaluating the bounds expression at some specific point in the program
>>>>>> execution. Since these types can only be written locally, the evaluation
>>>>>> point is obvious. If we wanted to dynamically enforce bounds during
>>>>>> initialization, it would simply be another use of the same computed 
>>>>>> bound:
>>>>>> int count = ...;
>>>>>> int (*ptr)[count * 10] = source_ptr;
>>>>>> 
>>>>>> Here we would evaluate count * 10 exactly once and use it both as (1) 
>>>>>> part
>>>>>> of the destination type when initializing ptr with source_ptr and (2)
>>>>>> part of the type of ptr for all uses of it. For example, if source_ptr
>>>>>> were of type int (*)[100], we would dynamically check that
>>>>>> count * 10 <= 100. This all works perfectly with an arbitrary bounds
>>>>>> expression; it could even contain an opaque function call.
>>>>>> Note that we don’t need any special behavior specifically for
>>>>>> initialization. If we later assign a new value into ptr, we will still be
>>>>>> converting the new value to the type int (*)[< count * 10 >], using the
>>>>>> value computed at the time of declaration of the variable. This model 
>>>>>> would
>>>>>> simply require that conversion to validate the bounds during assignment 
>>>>>> just
>>>>>> as it would during initialization.
>>>>>> Now, with nested arrays, variance does become a problem. Let’s reduce
>>>>>> bounds expression to their evaluated bounds to make this easier to write.
>>>>>>   • int (*)[11] can be converted to int(*)[10] because we’re simply
>>>>>> allowing fewer elements to be used.
>>>>>>   • By the same token, int (*(*)[11])[5] can be converted to
>>>>>> int (*(*)[10])[5]. This is the same logic as the above, just with an
>>>>>> element type that happens to be a pointer to array type.
>>>>>>   • But int (*(*)[11])[5] cannot be safely converted to int 
>>>>>> (*(*)[11])[4],
>>>>>> because while it’s safe to read an int (*)[4] from this array, it’s
>>>>>> not safe to assign one into it.
>>>>>>   • int (* const (*)[11])[5] can be safely converted to
>>>>>> int (* const (*)[11])[4], but only if this dialect also enforces const-
>>>>>> correctness, at least on array pointers.
>>>>>> Anyway, a lot of this changes if we want to use the same concept for
>>>>>> non-local pointers to arrays, because we no longer have an obvious point 
>>>>>> of
>>>>>> execution at which to evaluate the bounds expression. Instead, we are 
>>>>>> forced
>>>>>> into re-evaluating it every time we access the variable holding the 
>>>>>> array.
>>>>>> Consider:
>>>>>> struct X {
>>>>>> int count;
>>>>>> int (*ptr)[count * 10]; // using my preferred syntax
>>>>>> };
>>>>>> 
>>>>>> void test(struct X *xp) {
>>>>>> // For the purposes of the conversion check here, the
>>>>>> // source type is int (*)[< xp->count * 10 >], freshly
>>>>>> // evaluated as part of the member access.
>>>>>> int (*local)[100] = xp->ptr;
>>>>>> }
>>>>>> 
>>>>>> This has several immediate consequences.
>>>>>> Firstly, we need to already be able to compute the correct bound when we 
>>>>>> do
>>>>>> the dynamic checks for assignments into this field. For local variably-
>>>>>> modified types, everything in the expression was already in scope and
>>>>>> presumably initialized, so this wasn’t a problem. Here, we’re not helped
>>>>>> by scope, and we are dependent on the count field already having been
>>>>>> initialized.
>>>>>> Secondly, we must be very concerned about anything that could change the
>>>>>> result of this evaluation. So we cannot allow an arbitrary expression;
>>>>>> it must be something that we can fully analyze for what could change it.
>>>>>> And if refers to variables or fields (which it presumably always will), 
>>>>>> we
>>>>>> must prevent assignments to those, or at least validate that any
>>>>>> assignments aren’t causing unsound changes to the bound expression.
>>>>>> Thirdly, that concern must apply non-locally: if we allow the address of 
>>>>>> the
>>>>>> pointer field to be taken (which is totally fine in the local case!),
>>>>>> we can no directly reason about mutations through that pointer, so we
>>>>>> have to prevent changes to the bounds variables/fields while the pointer 
>>>>>> is
>>>>>> outstanding.
>>>>>> And finally, we must be able to recognize combinations of assignments,
>>>>>> because when we’re initializing (or completely rewriting) this structure,
>>>>>> we will need to able to assign to both count and ptr and not have the
>>>>>> same restrictions in place that we would for separate assignments.
>>>>>> None of this falls out naturally from separate, local language rules; it
>>>>>> all has to be invented for the purpose of serving this dynamic check. And
>>>>>> in fact, -fbounds-safety has to do all of this already just to make
>>>>>> basic checks involving pointers in structs work.
>>>>>> If that can all be established, though, I think the type-conversion-based
>>>>>> approach using variably-modified types has some very nice properties as a
>>>>>> complement to what we’re doing in -fbounds-safety.
>>>>>> For one, it interacts with the -fbounds-safety analysis very cleanly. If
>>>>>> bounds in types are dynamically enforced (which is not true in normal C,
>>>>>> but could be in this dialect), then the type becomes a source for 
>>>>>> reliable
>>>>>> reliable information for the bounds-safety analysis. Conversely, if
>>>>>> a pointer is converted to a variably-modified type, the analysis done
>>>>>> by -bounds-safety could be used as an input to the conversion check.
>>>>>> For another, I think it may lead towards an cleaner story for arrays of
>>>>>> pointers to arrays than -fbounds-safety can achieve today, as long as
>>>>>> the inner arrays are of uniform length.
>>>>>> But ultimately, I think it’s still at best a complement to the attributes
>>>>>> we need for -fbounds-safety.
>>>>>> John.
>>>>> 
>>>> 
>>> 
>> 
>> Yeoul

Thanks,
Yeoul

Re: [RFC] [C]New syntax for the argument of counted_by attribute for C language

Reply via email to