https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79784
--- Comment #10 from Chen Baozi <cbz at baozis dot org> --- I have attached the testcase I used to benchmark synchronization of OpenMP on AArch64, which is extracted from EPCC OpenMP micro-benchmark suite. The operating system I use is ubuntu 16.04 with 4.4.0 kernel. The hardware I use is an experimental 16-core aarch64 platform. There are 4 clusters of cpu cores interconnected with L3 cache, in each of which contains 4 cores. And the thrashing seems to be more severely when the threads are distributed in one cluster, e.g., 4 threads distributed 4 different clusters looks much better than the case when 4 threads distributed in one cluster.