melver wrote: > > I had to revert because it causes Linux kernel build errors (false > > positives), so I need to take another look. > > Could you provide some details? Was a variable being reused for something > else?
The short version is: a variable was being switched out, and the scheduler code marked this "unsafe" pattern as a fake re-allocation using the `unsafe_alias(void **)` marker. More details and how I came to the current proposal: We started generating false positives in this function with alias invalidation removed: - https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/sched/core.c?h=v7.0#n2538 - https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/sched/core.c?h=v7.0#n8098 Which looks like this: ``` CC kernel/sched/core.o kernel/sched/core.c:2617:2: error: releasing raw_spinlock 'rq_lockp(rq)' that was not held [-Werror,-Wthread-safety-analysis] 2617 | rq_unlock(rq, &rf); | ^ kernel/sched/core.c:2624:1: error: raw_spinlock 'rq_lockp(__this_rq())' is not held on every path through here [-Werror,-Wthread-safety-analysis] 2624 | } | ^ kernel/sched/core.c:2540:2: note: raw_spinlock acquired here 2540 | rq_lock(rq, &rf); | ^ kernel/sched/core.c:8102:3: error: releasing raw_spinlock 'rq_lockp(rq)' that was not held [-Werror,-Wthread-safety-analysis] 8102 | rq_unlock(rq, &rf); | ^ kernel/sched/core.c:8108:1: error: raw_spinlock 'rq_lockp(__this_rq())' is still held at the end of function [-Werror,-Wthread-safety-analysis] 8108 | } | ^ kernel/sched/core.c:8098:3: note: raw_spinlock acquired here 8098 | rq_lock(rq, &rf); | ^ 4 errors generated. ``` With alias analysis, ThreadSafetyAnalysis correctly detected that `rq` was being conditionally switched out; because the switch was conditional, at the branch merge-point, `rq` was also being invalidated: ``` if (task_rq(p) == rq && task_on_rq_queued(p)) rq = __migrate_task(rq, &rf, p, cpu); ``` To avoid `no_thread_safety_analysis` for the whole function, we use the `unsafe_alias(void **)` helper, along with a descriptive comment why it's safe. That helper invalidates the alias before first use, so that the analysis uses just 'rq' and 'rq_lockp(rq)' as the lock being locked/unlocked; after that the inner reassignment doesn't matter. While this pattern is a little more complex than "just reallocation", I'd like to detect it and provide an explicit escape hatch that forces documenting it. The question I asked myself is if (1) we don't want to blanket invalidate aliases passed by mutable reference/pointer, (2) are there idioms which are very likely unsafe that we should detect? Which led me to "type-erased alias passing" and the proposal in https://github.com/llvm/llvm-project/pull/190154 - more below... > > In particular, I didn't like that we effectively undid what > > testPointerAliasEscapeMultiple() was testing (avoid false negatives for > > this case, which @ziqingluo-90 fixed earlier). > > I actually liked it. If we're doing alias analysis, then for all we know, `F` > and `Fp` point to the same thing. This call invalidation will work in some > examples and in others it will fail, and I don't think we'll ever get > everything right. I think we have to keep this limited and predictable. Agree on predictability. The goal is to replace the old heuristics with a much simpler rule grounded in type semantics that we can use as a strong signal of identity mutation. If we don't invalidate aliases when they are explicitly passed as a different type (usually `void**`), we're making the unsafe assumption that the underlying object and capability/lock state is "compatible", even though the code just went out of its way to treat it differently. > > And I found real cases in the Linux kernel where there's some reallocation > > of an object going on that I _do_ want to catch (or suppress explicitly). > > We can't catch everything, and especially tracking values or variables is a > bit out-of-scope for us. While general value tracking is out of scope, maintaining the integrity of capability/lock identities is at the core of TSA: I'd argue we're not tracking arbitrary values, but those that are important to diagnose correct/incorrect synchronization. > It might be interesting though it the alias analysis is what causes the false > negative, i.e. if there would be a warning without it. There aren't false negatives, but IMHO simply missed opportunities to warn of unsafe synchronization patterns (which is in scope for us). Specifically, upon reallocation of an object, the old TSA and TSA with alias analysis that would skip invalidation would not be able to detect bugs such as: ``` struct foo { struct mutex mtx; size_t size; char variable_buffer[]; // trailing buffer }; struct foo *v = make_initial_foo(); // ... start some threads that are passed 'v' ... lock(&v->mtx); // ... do stuff ... expand_size = ... more space ... // missing unlock - generic realloction + copy would be a data race realloc_array((void **)&v, v->size + expand_size); // ... do more stuff ... unlock(&v->mtx); ``` This is the kind of pattern that we can cheaply help detect, and is in scope because it's broken synchronization. And back to the Linux kernel: the above trailing buffer pattern is very very common. By invalidating aliases that are passed type-erased we detect this narrow case, and at the same time we get the `unsafe_alias()` escape hatch back for deliberately complex patterns like conditional alias reassignments. https://github.com/llvm/llvm-project/pull/187691 _______________________________________________ cfe-commits mailing list [email protected] https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
