On Mon, Mar 07, 2016 at 10:54:25PM -0800, Andrew Pinski wrote: > On Mon, Mar 7, 2016 at 8:12 PM, Yangfei (Felix) <felix.y...@huawei.com> wrote: > >> On Mon, Mar 7, 2016 at 7:27 PM, Yangfei (Felix) <felix.y...@huawei.com> > >> wrote: > >> > Hi, > >> > > >> > As discussed in LKML: > >> http://lists.infradead.org/pipermail/linux-arm-kernel/2015-July/355996.html, > >> the > >> cost of changing a cache line > >> > from shared to exclusive state can be significant on aarch64 cores, > >> especially when this is triggered by an exclusive store, since it may > >> > result in having to retry the transaction. > >> > This patch makes use of the "prfm PSTL1STRM" instruction to prefetch > >> cache lines for write prior to ldxr/stxr loops generated by the ll/sc > >> atomic > >> routines. > >> > Bootstrapped on AArch64 server, is it OK? > >> > >> > >> I don't think this is a good thing in general. For an example on > >> ThunderX, the > >> prefetch just adds a cycle for no benefit. This really depends on the > >> micro-architecture of the core and how LDXR/STXR are > >> implemented. So after this patch, it will slow down ThunderX. > >> > >> Thanks, > >> Andrew Pinski > >> > > > > Hi Andrew, > > > > I am not quite clear about the ThunderX micro-arch. But, Yes, I agree > > it depends on the micro-architecture of the core. As the mentioned > > kernel patch is merged upstream, I think the added prefetch instruction > > in atomic routines is good for most of AArch64 cores in the market. If > > it does nothing good for ThunderX, then how about adding some checking > > here? I mean disabling the the generation of the prfm if we are tuning > > for ThunderX. > > No it is not just not do any good, it actually causes worse > performance for ThunderX. How about only doing it for the > micro-architecture where it helps and also not do it for generic since > it hurts ThunderX so much.
This should be a GCC 7 patch at this point, which should give us some time to talk through whether we want this patch or not. How bad is this for ThunderX - upthread you said one cycle penalty, but here you suggest it hurts ThunderX more? Note that the prefetch is outside of the LDXR/STXR loop. Thanks, James