http://msmvps.com/blogs/kernelmustard/archive/2004/09/20/13836.aspx

Memory Barriers Wrap-up

Hello blogosphere! I hope everyone had a great time this weekend puzzling through the mysteries of memory barriers. Personally, I spent the weekend coding and reading about realtivity (a recent post by Raymond Chen got me re-re-re-re-re-started on physics again).

In addition to the above-mentioned nonsense, I got some time to drag out the intel manuals to see what they had to say about x86 memory barriers. For the curious, the details can be found in section 7.3 of the 3rd volume of the Intel Pentium 4 manuals.

The situation is slightly different between the {i486, P5} and P6+ (Pentium Pro, Pentium II, Xeon, etc.) processors. The first group of chips enforces relatively strong program ordering of reads and writes at all times, with one exception: read misses are allowed to go ahead of write hits. In other words, if a program writes to memory location 1 and then reads from memory location 2, the read is allowed to hit the system bus before the write. This is because the execution stream inside the processor is usually totally blocked waiting for reads, whereas writes can be "queued" to the cache somewhat more asynchronously in the core without blocking program flow.

The P6-based processors present a slightly different story, adding support for out-of-order writes of long string data and speculative read support. In order to control these features of the processor, Intel has supplied a few instructions to enforce memory ordering. There are three explicit fence instructions - LFENCE, SFENCE, and MFENCE.

  • LFENCE - Load fence - all pending load operations must be completed by the time an LFENCE executes
  • SFENCE - Store fence - all pending store operations must be completed by the time an SFENCE executes
  • MFENCE - Memory fence - all pending load and store operations must be completed by the time an MFENCE executes

These instructions are in addition to the "synchronizing" instructions, such as interlocked memory operations and the CPUID instruction. The latter cause a total pipeline flush, leading to less-efficient utilization of the CPU. It should be noted that the DDK defines KeMemoryBarrier() using an interlocked store operation, so KeMemoryBarrier() sufferes from this performance issue.

This story changes on other architectures, as I've said before, so the best practice is stil to code defensively and use memory barriers where you need them. However, it doesn't look like you're likely to run into these situations in x86-land.


Reply via email to