Re: [PATCH, AArch64] atomics: prefetch the destination for write prior to ldxr/stxr loops

James Greenhalgh Tue, 15 Mar 2016 08:32:11 -0700

On Mon, Mar 07, 2016 at 10:54:25PM -0800, Andrew Pinski wrote:
> On Mon, Mar 7, 2016 at 8:12 PM, Yangfei (Felix) <[email protected]> wrote:
> >> On Mon, Mar 7, 2016 at 7:27 PM, Yangfei (Felix) <[email protected]> 
> >> wrote:
> >> > Hi,
> >> >
> >> >     As discussed in LKML:
> >> http://lists.infradead.org/pipermail/linux-arm-kernel/2015-July/355996.html,
> >>  the
> >> cost of changing a cache line
> >> >     from shared to exclusive state can be significant on aarch64 cores,
> >> especially when this is triggered by an exclusive store, since it may
> >> >     result in having to retry the transaction.
> >> >     This patch makes use of the "prfm PSTL1STRM" instruction to prefetch
> >> cache lines for write prior to ldxr/stxr loops generated by the ll/sc 
> >> atomic
> >> routines.
> >> >     Bootstrapped on AArch64 server, is it OK?
> >>
> >>
> >> I don't think this is a good thing in general.  For an example on 
> >> ThunderX, the
> >> prefetch just adds a cycle for no benefit.  This really depends on the
> >> micro-architecture of the core and how LDXR/STXR are
> >> implemented.   So after this patch, it will slow down ThunderX.
> >>
> >> Thanks,
> >> Andrew Pinski
> >>
> >
> > Hi Andrew,
> >
> >    I am not quite clear about the ThunderX micro-arch.  But, Yes, I agree
> >    it depends on the micro-architecture of the core.  As the mentioned
> >    kernel patch is merged upstream, I think the added prefetch instruction
> >    in atomic routines is good for most of AArch64 cores in the market.  If
> >    it does nothing good for ThunderX, then how about adding some checking
> >    here?  I mean disabling the the generation of the prfm if we are tuning
> >    for ThunderX.
> 
> No it is not just not do any good, it actually causes worse
> performance for ThunderX.  How about only doing it for the
> micro-architecture where it helps and also not do it for generic since
> it hurts ThunderX so much.


This should be a GCC 7 patch at this point, which should give us some time
to talk through whether we want this patch or not.

How bad is this for ThunderX - upthread you said one cycle penalty, but here
you suggest it hurts ThunderX more? Note that the prefetch is outside of
the LDXR/STXR loop.

Thanks,
James

Re: [PATCH, AArch64] atomics: prefetch the destination for write prior to ldxr/stxr loops

Reply via email to