Re: [RFC] [C]New syntax for the argument of counted_by attribute for C language

Yeoul Na Fri, 28 Mar 2025 06:05:39 -0700


> On Mar 28, 2025, at 5:51 AM, Yeoul Na <yeoul...@apple.com> wrote:
> 
> 
> 
>> On Mar 27, 2025, at 9:17 AM, Qing Zhao <qing.z...@oracle.com> wrote:
>> 
>> Yeoul,
>> 
>> Thanks for the writeup.
>> 
>> So, basically, This writeup insisted on introducing a new “structure scope” 
>> (similar as the instance scope in C++) into C language ONLY for counted_by 
>> attribute:
>> 
>> 1. Inside counted_by attribute, the name lookup starts:
>> 
>>    A. Inside the current structure first (the NEW structure scope added to 
>> C);
>>    B. Then outside the structure; (other current C scopes, local scope or 
>> global scope)
>> 
>> 2. When trying to reference a variable outside of the structure scope that 
>> name_conflicts with
>>    a structure member, a new builtin function “__builtin_global_ref” is 
>> introduced for such 
>>    purpose.
>> 
>>   ( I think that __builtin_global_ref might not accurate, because the outer 
>> scope might be either global scope or local scope)
> 
> Clarification: __builtin_global_ref will see the global scope directly. This 
> is similar to global scope resolution syntax (‘::’) in C++.
> 
> constexpr int len = 10;
> 
> void foo (void)
> {
>   const int len = 20;
> 
>   struct s {
>     int len;
>     int *__counted_by(__builtin_global_ref(len)) buf; // refers to global 
> ‘len'
>   };
> }
> 
> Here are some reasons why we chose to provide a global scope resolution 
> builtin, not a builtin to see an outer scope or just a local scope:
> 
> 1) The builtin is a substitute for some “scope resolution specifier”. Scope 
> specifiers typically meant to choose a “specific" scope.
> 2) To the best of my knowledge there is no precedence in any other C family 
> language to provide a scope resolution for local scopes.
> 3) Name conflicts with local variables can be easily renamed.
> 4) If we provide a builtin that selects outer scope instead, there is no way 
> to choose a global ‘len' if it’s shadowed by a local variable, so then the 
> member name has to be renamed anyway in order to choose a global `len`. 
> 5) This way, code can be written compatibly both in C and C++.
> 
>> 
>> 3. Where there is confliction between counted_by and VLA such as:
>> 
>> constexpr int len = 10;
>> 
>> struct s {
>>  int len;
>>  int *__counted_by(len) buf; // refers to struct member `len`.
>>  int arr[len]; // refers to global constexpr `len`
>> };
>> 
>> Issue compiler warning to user to ask the user to use __builtin_global_ref 
>> to distinguish. 
> 
> Additionally, our proposal suggests __builtin_member_ref to explicitly use a 
> member in a similar situation.
> The builtin could be replaced by ‘__self' or some other syntax once the 
> standard committee decides in the future, but earlier in the thread JeanHeyd 
> pointed out that:
> 
>       "I would like to gently push back about __self__, or __self, or self, 
> because all of these identifiers are fairly common identifiers in code. When 
> I writing the paper for __self_func ( 
> https://thephd.dev/_vendor/future_cxx/papers/C%20-%20__self_func.html ), I 
> searched GitHub and other source code indexing and repository services: 
> __self, __self__, and self has a substantial amount of uses. If there's an 
> alternative spelling to consider, I think that would be helpful."


Additionally, the above being said, once we agreed on what is the right syntax 
to use to access a member, our proposal doesn’t object to introducing it and 
using it optionally.

> 
> Thus, I think instead of trying to stick to a certain syntax right now, using 
> some builtin will allow us to easily migrate to a new syntax by guarding the 
> current usage under a macro.
> 
> Writing the builtin could be cumbersome but this shall be written only when 
> there is an ambiguity. Btw, I’m open to any other name suggestions for the 
> builtins!
> 
>> 
>> Are the above the correct understanding of your writeup?
> 
> Yes, it’s mostly correct, except some clarifications I made above. Thank you!
> 
>> 
>> 
>> From my understanding:
>> 
>> 1. This design started from the C++’s point of view by adding a new 
>> “structure scope” to C;
>> 2. This design conflicts with the current VLA default scope rule (which 
>> based on the default C scopes) in C.
>>     In the above example that mixes counted_by and VLA, it’s so weird that  
>> there are two difference name
>>     lookup rules inside the same structure. 
>>     It’s clearly a design bug. Either VLA or counted_by need to be fixed to 
>> make them consistent. 
>> 
>> 
>> I personally do not completely object to introduce a new “structure scope” 
>> into C, but it’s so hard for me to accept
>> that there are two different name lookup rules inside the same structure: 
>> one rule for VLA, another rule for counted_by
>> attribute.  (If introducing a new “structure scope” to C,  I think it’s 
>> better to change VLA to “structure scope” too, not sure
>> whether this is feasible or not)
>> 
>> I still think that introduce a new keyword “__self” for referring member 
>> variable inside structure without adding 
>> a new “structure scope" should be the best approach to resolve this issue in 
>> C. 
>> 
>> However, I am really hoping that the discussion can be converged soon. So, I 
>> am okay with adding a new “structure scope”
>> If most of people agreed on that approach. 
> 
> Thanks for the flexibility!
> 
>> 
>> Qing
>> 
>> 
>>> On Mar 26, 2025, at 12:59, Yeoul Na <yeoul...@apple.com> wrote:
>>> 
>>> Hi all,
>>> 
>>> Thanks for all the discussions.
>>> 
>>> I posted the design rationale for our current approach in 
>>> https://discourse.llvm.org/t/rfc-forward-referencing-a-struct-member-within-bounds-annotations/85510.
>>>  This clarifies some of the questions that are asked in this thread. The 
>>> document also proposes diagnostics to mitigate potential ambiguity, and 
>>> propose new builtins that can be used as a suppression and disambiguation 
>>> mechanism.
>>> 
>>> Best regards,
>>> Yeoul
>>> 
>>>> On Mar 26, 2025, at 9:11 AM, Yeoul Na <yeoul...@apple.com> wrote:
>>>> 
>>>> Sorry for the delay.
>>>> 
>>>> I’m planning on sending out our design rationale of the current approach 
>>>> without the new syntax today.
>>>> 
>>>> - Yeoul
>>>> 
>>>>> On Mar 14, 2025, at 9:22 PM, John McCall <rjmcc...@apple.com> wrote:
>>>>> 
>>>>> On 14 Mar 2025, at 15:18, Martin Uecker wrote:
>>>>> Am Freitag, dem 14.03.2025 um 14:42 -0400 schrieb John McCall:
>>>>> On 14 Mar 2025, at 14:13, Martin Uecker wrote:
>>>>> Am Freitag, dem 14.03.2025 um 10:11 -0700 schrieb David Tarditi:
>>>>> Hi Martin,
>>>>> The C design of VLAs misunderstood dependent typing.
>>>>> They probably did not care about theory, but the design is 
>>>>> not inconsistent with theory.
>>>>> This is almost true, but for bad reasons. The theory of dependent types 
>>>>> is heavily concerned with deciding whether two types are the same, and C 
>>>>> simply sidesteps this question because type identity is largely 
>>>>> meaningless in C. Every value of variably-modified type is (or decays to) 
>>>>> a pointer, and all pointers in C freely convert to one another (within 
>>>>> the object/function categories). _Generic is based on type compatibility, 
>>>>> not equality. So in that sense, the standard doesn’t say anything 
>>>>> inconsistent with theory because it doesn’t even try to say anything.
>>>>> The reason it is not quite true is that C does have rules for compatible 
>>>>> and composite types, and alas, those rules for variably-modified types 
>>>>> are not consistent with theory. Two VLA types of compatible element type 
>>>>> are always statically considered compatible, and it’s simply UB if the 
>>>>> sizes aren’t the same. The composite type of a VLA and a fixed-size array 
>>>>> type is always the fixed-size array type. The standard is literally 
>>>>> incomplete about the composite type of two VLAs; if you use a ternary 
>>>>> operator where both operands are casts to VLA types, the standard just 
>>>>> says it’s straight-up just undefined behavior (because one of the types 
>>>>> has a bound that’s unevaluated) and doesn’t even bother telling us what 
>>>>> the static type is supposed to be.
>>>>> Yes, I guess this is all true.
>>>>> But let's rephrase my point a bit more precisely: One could take 
>>>>> a strict subset of C that includes variably modified types but 
>>>>> obviously has to forbid a lot other things (e.g. arbitrary pointer 
>>>>> conversions or unsafe down-casts and much more) and make this a 
>>>>> memory-safe language with dependent types. This would also 
>>>>> require adding run-time checks at certain places where there 
>>>>> is now UB, in particular where two VLA types need to be compatible.
>>>>> Mmm. You can certainly subset C to the point that it’s memory-safe, but
>>>>> it wouldn’t really be anything like C anymore. As long as C has a heap,
>>>>> I don’t see any path to achieving temporal safety without significant
>>>>> extensions to the language. But if we’re just talking about spatial 
>>>>> safety,
>>>>> then sure, that could be a lot closer to C today.
>>>>> Is that your vision, then, that you’d like to see the same sort of checks
>>>>> that -fbounds-safety does, but you want them based firmly in the language
>>>>> as a dynamic check triggered by pointer type conversion, with bounds
>>>>> specified using variably-modified types? It’s a pretty elegant vision, and
>>>>> I can see the attraction. It has some real merits, which I’ll get to 
>>>>> below.
>>>>> I do see at least two significant challenges, though.
>>>>> The first and biggest problem is that, in general, array bounds can only 
>>>>> be
>>>>> expressed on a pointer value if it’s got pointer to array type. Most C 
>>>>> array
>>>>> code today works primarily with pointers to elements; programmers just use
>>>>> array types to create concrete arrays, and they very rarely use pointers 
>>>>> to
>>>>> array type at all. There are a bunch of reasons for that:
>>>>>    • Pointers to arrays have to be dereferenced twice: (*ptr)[idx] instead
>>>>> of ptr[idx].
>>>>>    • That makes them more error-prone, because it is easy to do pointer
>>>>> arithmetic at the wrong level, e.g. by writing ptr[idx], which will
>>>>> stride by multiples of the entire array size. That may even pass the
>>>>> compiler without complaint because of C’s laxness about conversions.
>>>>>    • Keeping the bound around in the pointer type is more work and 
>>>>> doesn’t do
>>>>> anything useful right now.
>>>>>    • A lot of C programmers dislike nested declarator syntax and can’t 
>>>>> remember
>>>>> how it works. Those of us who can write it off the top of our heads are
>>>>> quite atypical.
>>>>> Now, there is an exception: you can write a parameter using an array type,
>>>>> and it actually declares a pointer parameter. You could imagine using this
>>>>> as a syntax for an enforceable array bound for arguments, although the
>>>>> committee did already decide that these bounds were meaningless without
>>>>> static. Unfortunately, you can’t do this in any other position and still
>>>>> end up with just a pointer, so it’s not helpful as a general syntax for
>>>>> associating bounds with pointers.
>>>>> The upshot is that this isn’t really something people can just adopt by
>>>>> adding annotations. It’s not just a significant rewrite, it’s a rewrite 
>>>>> that
>>>>> programmers will have very legitimate objections to. I think that makes 
>>>>> this
>>>>> at best a complement to the “sidecar” approach taken by -fbounds-safety
>>>>> where we can track top-level bounds to a specific pointer value.
>>>>> The second problem is that there are some extralingual problems that
>>>>> -fbounds-safety has to solve around bounds that aren’t just local
>>>>> evaluations of bounds expressions, and a type-conversion-driven approach
>>>>> doesn’t help with any of them.
>>>>> As you mentioned, the design of variably-modified types is based on
>>>>> evaluating the bounds expression at some specific point in the program
>>>>> execution. Since these types can only be written locally, the evaluation
>>>>> point is obvious. If we wanted to dynamically enforce bounds during
>>>>> initialization, it would simply be another use of the same computed bound:
>>>>> int count = ...;
>>>>> int (*ptr)[count * 10] = source_ptr;
>>>>> 
>>>>> Here we would evaluate count * 10 exactly once and use it both as (1) part
>>>>> of the destination type when initializing ptr with source_ptr and (2)
>>>>> part of the type of ptr for all uses of it. For example, if source_ptr
>>>>> were of type int (*)[100], we would dynamically check that
>>>>> count * 10 <= 100. This all works perfectly with an arbitrary bounds
>>>>> expression; it could even contain an opaque function call.
>>>>> Note that we don’t need any special behavior specifically for
>>>>> initialization. If we later assign a new value into ptr, we will still be
>>>>> converting the new value to the type int (*)[< count * 10 >], using the
>>>>> value computed at the time of declaration of the variable. This model 
>>>>> would
>>>>> simply require that conversion to validate the bounds during assignment 
>>>>> just
>>>>> as it would during initialization.
>>>>> Now, with nested arrays, variance does become a problem. Let’s reduce
>>>>> bounds expression to their evaluated bounds to make this easier to write.
>>>>>    • int (*)[11] can be converted to int(*)[10] because we’re simply
>>>>> allowing fewer elements to be used.
>>>>>    • By the same token, int (*(*)[11])[5] can be converted to
>>>>> int (*(*)[10])[5]. This is the same logic as the above, just with an
>>>>> element type that happens to be a pointer to array type.
>>>>>    • But int (*(*)[11])[5] cannot be safely converted to int 
>>>>> (*(*)[11])[4],
>>>>> because while it’s safe to read an int (*)[4] from this array, it’s
>>>>> not safe to assign one into it.
>>>>>    • int (* const (*)[11])[5] can be safely converted to
>>>>> int (* const (*)[11])[4], but only if this dialect also enforces const-
>>>>> correctness, at least on array pointers.
>>>>> Anyway, a lot of this changes if we want to use the same concept for
>>>>> non-local pointers to arrays, because we no longer have an obvious point 
>>>>> of
>>>>> execution at which to evaluate the bounds expression. Instead, we are 
>>>>> forced
>>>>> into re-evaluating it every time we access the variable holding the array.
>>>>> Consider:
>>>>> struct X {
>>>>> int count;
>>>>> int (*ptr)[count * 10]; // using my preferred syntax
>>>>> };
>>>>> 
>>>>> void test(struct X *xp) {
>>>>> // For the purposes of the conversion check here, the
>>>>> // source type is int (*)[< xp->count * 10 >], freshly
>>>>> // evaluated as part of the member access.
>>>>> int (*local)[100] = xp->ptr;
>>>>> }
>>>>> 
>>>>> This has several immediate consequences.
>>>>> Firstly, we need to already be able to compute the correct bound when we 
>>>>> do
>>>>> the dynamic checks for assignments into this field. For local variably-
>>>>> modified types, everything in the expression was already in scope and
>>>>> presumably initialized, so this wasn’t a problem. Here, we’re not helped
>>>>> by scope, and we are dependent on the count field already having been
>>>>> initialized.
>>>>> Secondly, we must be very concerned about anything that could change the
>>>>> result of this evaluation. So we cannot allow an arbitrary expression;
>>>>> it must be something that we can fully analyze for what could change it.
>>>>> And if refers to variables or fields (which it presumably always will), we
>>>>> must prevent assignments to those, or at least validate that any
>>>>> assignments aren’t causing unsound changes to the bound expression.
>>>>> Thirdly, that concern must apply non-locally: if we allow the address of 
>>>>> the
>>>>> pointer field to be taken (which is totally fine in the local case!),
>>>>> we can no directly reason about mutations through that pointer, so we
>>>>> have to prevent changes to the bounds variables/fields while the pointer 
>>>>> is
>>>>> outstanding.
>>>>> And finally, we must be able to recognize combinations of assignments,
>>>>> because when we’re initializing (or completely rewriting) this structure,
>>>>> we will need to able to assign to both count and ptr and not have the
>>>>> same restrictions in place that we would for separate assignments.
>>>>> None of this falls out naturally from separate, local language rules; it
>>>>> all has to be invented for the purpose of serving this dynamic check. And
>>>>> in fact, -fbounds-safety has to do all of this already just to make
>>>>> basic checks involving pointers in structs work.
>>>>> If that can all be established, though, I think the type-conversion-based
>>>>> approach using variably-modified types has some very nice properties as a
>>>>> complement to what we’re doing in -fbounds-safety.
>>>>> For one, it interacts with the -fbounds-safety analysis very cleanly. If
>>>>> bounds in types are dynamically enforced (which is not true in normal C,
>>>>> but could be in this dialect), then the type becomes a source for reliable
>>>>> reliable information for the bounds-safety analysis. Conversely, if
>>>>> a pointer is converted to a variably-modified type, the analysis done
>>>>> by -bounds-safety could be used as an input to the conversion check.
>>>>> For another, I think it may lead towards an cleaner story for arrays of
>>>>> pointers to arrays than -fbounds-safety can achieve today, as long as
>>>>> the inner arrays are of uniform length.
>>>>> But ultimately, I think it’s still at best a complement to the attributes
>>>>> we need for -fbounds-safety.
>>>>> John.
>>>> 
>>> 
>> 
> 
> Yeoul

Re: [RFC] [C]New syntax for the argument of counted_by attribute for C language

Reply via email to