On Fri, Jul 16, 2021 at 11:12 AM Matthias Kretz <m.kr...@gsi.de> wrote:
> On Friday, 16 July 2021 04:41:17 CEST Jason Merrill via Gcc-patches wrote: > > > Currently the patch does not adjust the values based on -march, as in > JF's > > > proposal. I'll need more guidance from the ARM/AArch64 maintainers > about > > > how to go about that. --param l1-cache-line-size is set based on > -mtune, > > > but I don't think we want -mtune to change these ABI-affecting values. > > > Are > > > there -march values for which a smaller range than 64-256 makes sense? > > As a user who cares about ABI but also cares about maximizing performance > of > builds for a specific HPC setup I'd expect the hardware interference size > values to be allowed to break ABIs. The point of these values is to give > me > better performance portability (but not necessarily binary portability) > than > my usual "pick 64 as a good average". > Wrt, -march / -mtune setting hardware interference size: IMO -mtune=X > should > be interpreted as "my binary is supposed to be optimized for X, I accept > inefficiencies on everything that's not X". > > On Friday, 16 July 2021 04:48:52 CEST Noah Goldstein wrote: > > On intel x86 systems with a private L2 cache the spatial prefetcher > > can cause destructive interference along 128 byte aligned boundaries. > > > https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-3 > > 2-architectures-optimization-manual.pdf#page=60 > > I don't understand how this feature would lead to false sharing. But maybe > I > misunderstand the spatial prefetcher. The first access to one of the two > cache > lines pairs would bring both cache lines to LLC (and possibly L2). If a > core > with a different L2 reads the other cache line the cache line would be > duplicated; if it writes to it, it would be exclusive to the other core's > L2. > The cache line pairs do not affect each other anymore. Maybe there's a > minor > inefficiency on initial transfer from memory, but isn't that all? > If two cores that do not share an L2 cache need exclusive access to a cache-line, the L2 spatial prefetcher could cause pingponging if those two cache-lines were adjacent and shared the same 128 byte alignment. Say core A requests line x1 in exclusive, it also get line x2 (not sure if x2 would be in shared or exclusive), core B then requests x2 in exclusive, it also gets x1. Irrelevant of the state x1 comes into core B's private L2 cache it invalidates the exclusive state on cache-line x1 in core A's private L2 cache. If this was done in a loop (say a simple `lock add` loop) it would cause pingponging on cache-lines x1/x2 between core A and B's private L2 caches. > > That said. Intel documents the spatial prefetcher exclusively for Sandy > Bridge. So if you still believe 128 is necessary, set the destructive > hardware > interference size to 64 for all of x86 except -mtune=sandybridge. > AFAIK the spatial prefetcher exists on newer x86_64 machines as well. > > -- > ────────────────────────────────────────────────────────────────────────── > Dr. Matthias Kretz https://mattkretz.github.io > GSI Helmholtz Centre for Heavy Ion Research https://gsi.de > std::experimental::simd https://github.com/VcDevel/std-simd > ────────────────────────────────────────────────────────────────────────── >