https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68718

--- Comment #3 from vries at gcc dot gnu.org ---
Created attachment 36940
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=36940&action=edit
more minimal version, also fails without fipa-pta

With this futher minimized version (stripped the sorting algorithm altogether),
we can reproduce the same failure without -fipa-pta.

However, in the openmp standard I read:
...
If multiple threads write without synchronization to the same memory unit,
including cases due to atomicity considerations as described above, then a data
race occurs. Similarly, if at least one thread reads from a memory unit and at
least one thread writes without synchronization to that same memory unit,
including cases due to atomicity considerations as described above, then a data
race occurs. If a data race occurs then the result of the program is
unspecified.
...

So, AFAIU, the read in the testcase is missing the lock. The test passes with
LOCK_READ set to 1, without -fipa-pta.

However, with -fipa-pta, the test still fails. It seems that fipa-pta
(commenting out compute_dependence_clique in compute_may_aliases to eliminate
possibly restrict influence) concludes that the read from busy and the lock
functions do not alias:
...
;;   basic block 5, loop depth 1, count 0, freq 110, maybe hot
;;    prev block 4, next block 6, flags: (NEW, REACHABLE)
;;    pred:       4 [100.0%]  (FALLTHRU)
;;                7 [1.0%]  (FALSE_VALUE,EXECUTABLE)
  # .MEM_4 = PHI <.MEM_25(4), .MEM_5(7)>
  # VUSE <.MEM_4>
  # PT = { D.2027 } (escaped)
  _20 = .omp_data_i_14(D)->lockD.2068;
  # .MEM_22 = VDEF <.MEM_4>
  # USE = nonlocal { D.2027 } (escaped)
  # CLB = nonlocal { D.2027 } (escaped)
  omp_set_lockD.1909 (_20);
  # VUSE <.MEM_22>
  _23 = .omp_data_i_14(D)->busyD.2066;
  # VUSE <.MEM_22>
  # PT = { D.2027 } (escaped)
  _24 = .omp_data_i_14(D)->lockD.2068;
  # .MEM_26 = VDEF <.MEM_22>
  # USE = nonlocal { D.2027 } (escaped)
  # CLB = nonlocal { D.2027 } (escaped)
  omp_unset_lockD.1911 (_24);
  if (_23 == 0)
    goto <bb 9>;
  else
    goto <bb 6>;
;;    succ:       9 [9.0%]  (TRUE_VALUE,EXECUTABLE)
;;                6 [91.0%]  (FALSE_VALUE,EXECUTABLE)
...

So we end up at optimized with the load hoisted out of the loop:
...
;;   basic block 4, loop depth 1, count 0, freq 110, maybe hot
;;    prev block 3, next block 5, flags: (NEW, REACHABLE)
;;    pred:       3 [100.0%]  (FALLTHRU,EXECUTABLE)
;;                6 [1.0%]  (FALSE_VALUE,EXECUTABLE)
  # .MEM_4 = PHI <.MEM_8(D)(3), .MEM_29(6)>
  # .MEM_22 = VDEF <.MEM_4>
  # USE = nonlocal { D.2027 } (escaped)
  # CLB = nonlocal { D.2027 } (escaped)
  omp_set_lockD.1909 (pretmp_38);
  # .MEM_26 = VDEF <.MEM_22>
  # USE = nonlocal { D.2027 } (escaped)
  # CLB = nonlocal { D.2027 } (escaped)
  omp_unset_lockD.1911 (pretmp_38);
  if (pretmp_37 == 0)
    goto <bb 7>;
  else
    goto <bb 5>;
;;    succ:       7 [9.0%]  (TRUE_VALUE,EXECUTABLE)
;;                5 [91.0%]  (FALSE_VALUE,EXECUTABLE)
...

I wonder how we should model things to fix this:
- mark shared variables (such as busy) as volatile inside the thread function
- interpret omp_set/unset_lock as a memory barrier (and by extension,
  all functions for which we don't have the function body, which might call
  such a function)

Reply via email to