http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50098
Bug #: 50098 Summary: The OpenMP ordered construct blocks parallelism, when appearing at the beginning of a loop body Classification: Unclassified Product: gcc Version: 4.4.3 Status: UNCONFIRMED Severity: normal Priority: P3 Component: libgomp AssignedTo: unassig...@gcc.gnu.org ReportedBy: terec...@gmail.com Created attachment 25021 --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=25021 program illustrating the performance issue of the ordered construct According to the OpenMP spec, the ordered construct enforces sequential ordering of the ordered region. In the GNU OpenMP implementation it works fine, if the ordered construct resides at the end of the loop body. However, when it is in the beginning, the parallelism is blocked. Using the attached C program, the timing figures for the sequential, ordered in the beginning and ordered in the end versions of the code are presented on a 4-core Intel CPU. The timing shows that the version, where the ordered construct sits in the beginning of the loop body, has a slowdown, whereas the version with the ordered construct at the end of the loop body is faster than the sequential code. andrei@jos:~/src$ gcc ordered_perf_bug.c -o /tmp/a.out andrei@jos:~/src$ time /tmp/a.out real 0m5.411s user 0m5.400s sys 0m0.000s andrei@jos:~/src$ gcc -fopenmp ordered_perf_bug.c -o /tmp/a.out andrei@jos:~/src$ time /tmp/a.out real 0m6.155s user 0m24.530s sys 0m0.010s andrei@jos:~/src$ gcc -DFAST_ORDERED -fopenmp ordered_perf_bug.c -o /tmp/a.out andrei@jos:~/src$ time /tmp/a.out real 0m3.082s user 0m12.290s sys 0m0.000s andrei@jos:~/src$ gcc --version gcc (Ubuntu 4.4.3-4ubuntu5) 4.4.3 Copyright (C) 2009 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. andrei@jos:~/src$ uname -a Linux jos 2.6.32-27-generic #49-Ubuntu SMP Thu Dec 2 00:51:09 UTC 2010 x86_64 GNU/Linux andrei@jos:~/src$ lshw jos description: Computer width: 64 bits capabilities: vsyscall64 vsyscall32 *-core description: Motherboard physical id: 0 *-memory description: System memory physical id: 0 size: 3922MiB *-cpu product: Intel(R) Core(TM) i5 CPU 760 @ 2.80GHz vendor: Intel Corp. physical id: 1 bus info: cpu@0 size: 1197MHz capacity: 1197MHz width: 64 bits capabilities: fpu fpu_exception wp vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp x86-64 constant_tsc arch_perfmon pebs bts rep_good xtopology nonstop_tsc aperfmperf pni dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm sse4_1 sse4_2 popcnt lahf_lm ida tpr_shadow vnmi flexpriority ept vpid cpufreq ... P.S. Looking at the GOMP_ordered_end(void) implementation, I suspect that it needs some synchronization code to fix the reported performance issue.