On Sun, Nov 02, 2025 at 04:18:48PM +0000, Philipp Stanner wrote:
> Hi Paul and all,
>
> I've read through Appendix C "Why Memory Barriers". It helps greatly
> with understanding the overall problem. However, one part in particular
> confused me and seemed to contradict the previous subsections:
>
> "However, the CPU need not actually invalidate the cache line before
> sending the acknowledgement." [1]
>
> Well yes, I think it absolutely needs to. The previous examples relied
> precisely on this. What a CPU sending an Invalidate Message actually is
> saying is: "I will modify this cache line that you currently have read-
> only in your local cache. Once you sent me the Invalidate-ACK I know
> that you have invalidated it and I can safely modify it."
>
> A CPU sending an Invalidate-ACK without actually having invalidated its
> cache line is, bluntly, lying and endangering the entire cache
> coherence.
>
> Now don't get me wrong, I accept that this is obviously what is really
> happening. But the chapter got me to the point of interpreting a
> truthfull Invalidate-ACK as an essential part of cache coherence.
The following sentence was intended to help: "It could instead queue
the invalidate message with the understanding that the message will
be processed before the CPU sends any further messages regarding that
cache line."
But to your point, I can see where this paragraph could use some
improvement. First, this is a hardware optimization and must be done
carefully. Second, it is possible that the code (executing on the CPU
receiving the invalidation message) will execute some memory-ordering
instruction that absolutely requires the cache line be invalidated at
that point. Alternatively, it is possible that early sending of the
invalidation message will require that further execution on that CPU
must be speculative. Third, building on second, hardware can violate
the rules as long as software running on that hardware cannot tell
the difference. Fourth, you are right that strict unoptimized MESI
would absolutely require that the cache line be invalidated prior to
acknowledging the invalidation.
> The previous section detailing the store-buffer, on the contrary, makes
> more sense: "Altough not owning this cache line yet, I can store my new
> value in the store buffer already because whatever the current value
> is, I will overwrite it anyways." whereas with the invalidate queue the
> reader just ignores that the variable might have changed.
Well, if the cacheline is in Modified or Exclusive state, then the
CPU must transition it to at least Shared (with extra state saying
"doomed" or some such). Or not, given yet more protocol complexity.
If the CPU receiving the invalidation request knows that the CPU sending
that request doesn't care what the current value of the cacheline is,
then the receiving CPU can pretend that any stores happened before it
received the invalidation request. Again, assuming that there are no
ordering instructions that prohibit this.
> I guess this is legal because the only real guarantee of CPUs is that
> one particular CPU sees all its accesses in order? But even then, as
> above, for store buffers it makes sense, because the storing CPU
> doesn't care about other values. The *reading* CPU sending the fake
> Invalidate-ACK, on the contrary, should very well care about reading
> the truthfull value from the cache line.
Also, different types of CPUs have different underlying ordering
guarantees. And speculative execution can often ignore those guarantees
as long as it can avoid the user-visible state seeing any violations. And
given multiple CPUs reading and modifying a given variable concurrently,
what exactly is the truthful value at any given point in time?
(Referring to Figure 15.10 ("A Variable With More Simultaneous Values").
> And if it all works like that, then what even is the point of
> Invalidate messages at all, if you can not rely on them being followed
> before you yourself start modifying the cache line?
Because they are needed for things like memory-ordering instructions
to work correctly. But on a weakly ordered system, if there are no
memory-ordering instructions in the code, then there are precious few
memory-ordering guarantees anyway. ;-)
> Or is the point that a CPU temporarily ignoring an Invalidate message
> can still validly (without memory barriers) use data in that cache line
> which does *not* get modified by the other CPU? So memory barriers in
> this scenario would allow for more efficiency by "segmenting" cache
> lines?
The point is mostly that on weakly ordered systems in the absence of
memory-ordering instructions, there are very few guarantees. See again
Figure 15.10.
> Quite confusing. Parallel programming is hard and discussing it is one
> thing we can do about it :]
Agreed!!!
Do the explanations above help? If so, I will rework that paragraph
with attribution.
Thanx, Paul
> Thanks,
> Philipp
>
>
> [1]
> https://git.kernel.org/pub/scm/linux/kernel/git/paulmck/perfbook.git/tree/appendix/whymb/whymemorybarriers.tex#n1127
>