https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68718
--- Comment #3 from vries at gcc dot gnu.org --- Created attachment 36940 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=36940&action=edit more minimal version, also fails without fipa-pta With this futher minimized version (stripped the sorting algorithm altogether), we can reproduce the same failure without -fipa-pta. However, in the openmp standard I read: ... If multiple threads write without synchronization to the same memory unit, including cases due to atomicity considerations as described above, then a data race occurs. Similarly, if at least one thread reads from a memory unit and at least one thread writes without synchronization to that same memory unit, including cases due to atomicity considerations as described above, then a data race occurs. If a data race occurs then the result of the program is unspecified. ... So, AFAIU, the read in the testcase is missing the lock. The test passes with LOCK_READ set to 1, without -fipa-pta. However, with -fipa-pta, the test still fails. It seems that fipa-pta (commenting out compute_dependence_clique in compute_may_aliases to eliminate possibly restrict influence) concludes that the read from busy and the lock functions do not alias: ... ;; basic block 5, loop depth 1, count 0, freq 110, maybe hot ;; prev block 4, next block 6, flags: (NEW, REACHABLE) ;; pred: 4 [100.0%] (FALLTHRU) ;; 7 [1.0%] (FALSE_VALUE,EXECUTABLE) # .MEM_4 = PHI <.MEM_25(4), .MEM_5(7)> # VUSE <.MEM_4> # PT = { D.2027 } (escaped) _20 = .omp_data_i_14(D)->lockD.2068; # .MEM_22 = VDEF <.MEM_4> # USE = nonlocal { D.2027 } (escaped) # CLB = nonlocal { D.2027 } (escaped) omp_set_lockD.1909 (_20); # VUSE <.MEM_22> _23 = .omp_data_i_14(D)->busyD.2066; # VUSE <.MEM_22> # PT = { D.2027 } (escaped) _24 = .omp_data_i_14(D)->lockD.2068; # .MEM_26 = VDEF <.MEM_22> # USE = nonlocal { D.2027 } (escaped) # CLB = nonlocal { D.2027 } (escaped) omp_unset_lockD.1911 (_24); if (_23 == 0) goto <bb 9>; else goto <bb 6>; ;; succ: 9 [9.0%] (TRUE_VALUE,EXECUTABLE) ;; 6 [91.0%] (FALSE_VALUE,EXECUTABLE) ... So we end up at optimized with the load hoisted out of the loop: ... ;; basic block 4, loop depth 1, count 0, freq 110, maybe hot ;; prev block 3, next block 5, flags: (NEW, REACHABLE) ;; pred: 3 [100.0%] (FALLTHRU,EXECUTABLE) ;; 6 [1.0%] (FALSE_VALUE,EXECUTABLE) # .MEM_4 = PHI <.MEM_8(D)(3), .MEM_29(6)> # .MEM_22 = VDEF <.MEM_4> # USE = nonlocal { D.2027 } (escaped) # CLB = nonlocal { D.2027 } (escaped) omp_set_lockD.1909 (pretmp_38); # .MEM_26 = VDEF <.MEM_22> # USE = nonlocal { D.2027 } (escaped) # CLB = nonlocal { D.2027 } (escaped) omp_unset_lockD.1911 (pretmp_38); if (pretmp_37 == 0) goto <bb 7>; else goto <bb 5>; ;; succ: 7 [9.0%] (TRUE_VALUE,EXECUTABLE) ;; 5 [91.0%] (FALSE_VALUE,EXECUTABLE) ... I wonder how we should model things to fix this: - mark shared variables (such as busy) as volatile inside the thread function - interpret omp_set/unset_lock as a memory barrier (and by extension, all functions for which we don't have the function body, which might call such a function)