On Fri, Jul 16, 2021 at 11:12 AM Matthias Kretz <m.kr...@gsi.de> wrote:

> On Friday, 16 July 2021 04:41:17 CEST Jason Merrill via Gcc-patches wrote:
> > > Currently the patch does not adjust the values based on -march, as in
> JF's
> > > proposal.  I'll need more guidance from the ARM/AArch64 maintainers
> about
> > > how to go about that.  --param l1-cache-line-size is set based on
> -mtune,
> > > but I don't think we want -mtune to change these ABI-affecting values.
> > > Are
> > > there -march values for which a smaller range than 64-256 makes sense?
>
> As a user who cares about ABI but also cares about maximizing performance
> of
> builds for a specific HPC setup I'd expect the hardware interference size
> values to be allowed to break ABIs. The point of these values is to give
> me
> better performance portability (but not necessarily binary portability)
> than
> my usual "pick 64 as a good average".


> Wrt, -march / -mtune setting hardware interference size: IMO -mtune=X
> should
> be interpreted as "my binary is supposed to be optimized for X, I accept
> inefficiencies on everything that's not X".
>
> On Friday, 16 July 2021 04:48:52 CEST Noah Goldstein wrote:
> > On intel x86 systems with a private L2 cache the spatial prefetcher
> > can cause destructive interference along 128 byte aligned boundaries.
> >
> https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-3
> > 2-architectures-optimization-manual.pdf#page=60
>
> I don't understand how this feature would lead to false sharing. But maybe
> I
> misunderstand the spatial prefetcher. The first access to one of the two
> cache
> lines pairs would bring both cache lines to LLC (and possibly L2). If a
> core
> with a different L2 reads the other cache line the cache line would be
> duplicated; if it writes to it, it would be exclusive to the other core's
> L2.
> The cache line pairs do not affect each other anymore. Maybe there's a
> minor
> inefficiency on initial transfer from memory, but isn't that all?
>

If two cores that do not share an L2 cache need exclusive access to
a cache-line, the L2 spatial prefetcher could cause pingponging if those
two cache-lines were adjacent and shared the same 128 byte alignment.
Say core A requests line x1 in exclusive, it also get line x2 (not sure
if x2 would be in shared or exclusive), core B then requests x2 in
exclusive,
it also gets x1. Irrelevant of the state x1 comes into core B's private L2
cache
it invalidates the exclusive state on cache-line x1 in core A's private L2
cache. If this was done in a loop (say a simple `lock add` loop) it would
cause
pingponging on cache-lines x1/x2 between core A and B's private L2 caches.


>
> That said. Intel documents the spatial prefetcher exclusively for Sandy
> Bridge. So if you still believe 128 is necessary, set the destructive
> hardware
> interference size to 64 for all of x86 except -mtune=sandybridge.
>

AFAIK the spatial prefetcher exists on newer x86_64 machines as well.


>
> --
> ──────────────────────────────────────────────────────────────────────────
>  Dr. Matthias Kretz                           https://mattkretz.github.io
>  GSI Helmholtz Centre for Heavy Ion Research               https://gsi.de
>  std::experimental::simd              https://github.com/VcDevel/std-simd
> ──────────────────────────────────────────────────────────────────────────
>

Reply via email to