melver wrote:

> > I had to revert because it causes Linux kernel build errors (false 
> > positives), so I need to take another look.
> 
> Could you provide some details? Was a variable being reused for something 
> else?

The short version is: a variable was being switched out, and the scheduler code 
marked this "unsafe" pattern as a fake re-allocation using the 
`unsafe_alias(void **)` marker.

More details and how I came to the current proposal:

We started generating false positives in this function with alias invalidation 
removed:
- 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/sched/core.c?h=v7.0#n2538
- 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/sched/core.c?h=v7.0#n8098

Which looks like this:
```
  CC      kernel/sched/core.o
kernel/sched/core.c:2617:2: error: releasing raw_spinlock 'rq_lockp(rq)' that 
was not held [-Werror,-Wthread-safety-analysis]
 2617 |         rq_unlock(rq, &rf);
      |         ^
kernel/sched/core.c:2624:1: error: raw_spinlock 'rq_lockp(__this_rq())' is not 
held on every path through here [-Werror,-Wthread-safety-analysis]
 2624 | }
      | ^
kernel/sched/core.c:2540:2: note: raw_spinlock acquired here
 2540 |         rq_lock(rq, &rf);
      |         ^
kernel/sched/core.c:8102:3: error: releasing raw_spinlock 'rq_lockp(rq)' that 
was not held [-Werror,-Wthread-safety-analysis]
 8102 |                 rq_unlock(rq, &rf);
      |                 ^
kernel/sched/core.c:8108:1: error: raw_spinlock 'rq_lockp(__this_rq())' is 
still held at the end of function [-Werror,-Wthread-safety-analysis]
 8108 | }
      | ^
kernel/sched/core.c:8098:3: note: raw_spinlock acquired here
 8098 |                 rq_lock(rq, &rf);
      |                 ^
4 errors generated.
```

With alias analysis, ThreadSafetyAnalysis correctly detected that `rq` was 
being conditionally switched out; because the switch was conditional, at the 
branch merge-point, `rq` was also being invalidated:

```
                if (task_rq(p) == rq && task_on_rq_queued(p))
                        rq = __migrate_task(rq, &rf, p, cpu);
```

To avoid `no_thread_safety_analysis` for the whole function, we use the 
`unsafe_alias(void **)` helper, along with a descriptive comment why it's safe. 
That helper invalidates the alias before first use, so that the analysis uses 
just 'rq' and 'rq_lockp(rq)' as the lock being locked/unlocked; after that the 
inner reassignment doesn't matter.

While this pattern is a little more complex than "just reallocation", I'd like 
to detect it and provide an explicit escape hatch that forces documenting it.

The question I asked myself is if (1) we don't want to blanket invalidate 
aliases passed by mutable reference/pointer, (2) are there idioms which are 
very likely unsafe that we should detect? Which led me to "type-erased alias 
passing" and the proposal in https://github.com/llvm/llvm-project/pull/190154 - 
more below...

> > In particular, I didn't like that we effectively undid what 
> > testPointerAliasEscapeMultiple() was testing (avoid false negatives for 
> > this case, which @ziqingluo-90 fixed earlier).
> 
> I actually liked it. If we're doing alias analysis, then for all we know, `F` 
> and `Fp` point to the same thing. This call invalidation will work in some 
> examples and in others it will fail, and I don't think we'll ever get 
> everything right. I think we have to keep this limited and predictable.

Agree on predictability. The goal is to replace the old heuristics with a much 
simpler rule grounded in type semantics that we can use as a strong signal of 
identity mutation.

If we don't invalidate aliases when they are explicitly passed as a different 
type (usually `void**`), we're making the unsafe assumption that the underlying 
object and capability/lock state is "compatible", even though the code just 
went out of its way to treat it differently.

> > And I found real cases in the Linux kernel where there's some reallocation 
> > of an object going on that I _do_ want to catch (or suppress explicitly).
> 
> We can't catch everything, and especially tracking values or variables is a 
> bit out-of-scope for us.

While general value tracking is out of scope, maintaining the integrity of 
capability/lock identities is at the core of TSA: I'd argue we're not tracking 
arbitrary values, but those that are important to diagnose correct/incorrect 
synchronization.

> It might be interesting though it the alias analysis is what causes the false 
> negative, i.e. if there would be a warning without it.

There aren't false negatives, but IMHO simply missed opportunities to warn of 
unsafe synchronization patterns (which is in scope for us).

Specifically, upon reallocation of an object, the old TSA and TSA with alias 
analysis that would skip invalidation would not be able to detect bugs such as:
```
struct foo {
   struct mutex mtx;
   size_t size;
   char variable_buffer[];  // trailing buffer
};

struct foo *v = make_initial_foo();
// ... start some threads that are passed 'v' ...
lock(&v->mtx);
// ... do stuff  ...
expand_size = ... more space ...
// missing unlock - generic realloction + copy would be a data race
realloc_array((void **)&v, v->size + expand_size);
// ... do more stuff ...
unlock(&v->mtx);
```

This is the kind of pattern that we can cheaply help detect, and is in scope 
because it's broken synchronization.

And back to the Linux kernel: the above trailing buffer pattern is very very 
common. By invalidating aliases that are passed type-erased we detect this 
narrow case, and at the same time we get the `unsafe_alias()` escape hatch back 
for deliberately complex patterns like conditional alias reassignments.


https://github.com/llvm/llvm-project/pull/187691
_______________________________________________
cfe-commits mailing list
[email protected]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

Reply via email to