https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90037

--- Comment #7 from Jeffrey A. Law <law at redhat dot com> ---
Making some progress here.  Working with a reduced testcase we have the
following key blocks as we enter DOM:


;;   basic block 3, loop depth 0
;;    pred:       2
  __builtin_strdup (spec_22(D));
  _56 = separator_21(D) - spec_22(D);
  ulen_57 = (size_t) _56;
  _58 = ulen_57 + 1;
  u_60 = __builtin_malloc (_58);
  goto <bb 14>; [100.00%]
;;    succ:       14

;;   basic block 4, loop depth 0
;;    pred:       15
  iftmp.0_27 = separator_21(D) + 1;
  goto <bb 14>; [100.00%]
;;    succ:       14

;;   basic block 5, loop depth 0
;;    pred:       14
  # error_msg_40 = PHI <0B(14)>
  # iftmp.0_50 = PHI <iftmp.0_54(14)>
  endpwent ();
  _51 = iftmp.0_50 != 0B;
  _52 = error_msg_40 == 0B;
  _53 = _51 & _52;
  if (_53 != 0)
    goto <bb 7>; [49.98%]
  else
    goto <bb 10>; [50.02%]
;;    succ:       7
;;                10

;;   basic block 6, loop depth 0
;;    pred:       13
  # error_msg_11 = PHI <"invalid spec"(13)>
  # iftmp.0_67 = PHI <0B(13)>
  endpwent ();
  _7 = iftmp.0_67 != 0B;
  _8 = error_msg_11 == 0B;
  _9 = _7 & _8;
  if (_9 != 0)
    goto <bb 7>; [49.98%]
  else
    goto <bb 9>; [50.02%]
;;    succ:       7
;;                9

;;   basic block 7, loop depth 0
;;    pred:       6
;;                12
;;                5
  # iftmp.0_68 = PHI <iftmp.0_67(6), iftmp.0_69(12), iftmp.0_50(5)>
  _10 = *iftmp.0_68;
  if (_10 != 43)
    goto <bb 8>; [48.88%]
  else
    goto <bb 10>; [51.12%]
;;    succ:       8
;;                10

[ ... ]

;;   basic block 12, loop depth 0
;;    pred:       14
;;                13
  # iftmp.0_69 = PHI <iftmp.0_54(14), 0B(13)>
  _44 = iftmp.0_69 != 0B;
  _45 = 1;
  _46 = _44 & _45;
  if (_46 != 0)
    goto <bb 7>; [100.00%]
  else
    goto <bb 10>; [0.00%]
;;    succ:       7
;;                10


[ ... ]

;;   basic block 14, loop depth 0
;;    pred:       3
;;                4
  # iftmp.0_54 = PHI <0B(3), iftmp.0_27(4)>
  # u_74 = PHI <u_60(3), u_66(4)>
  if (u_74 != 0B)
    goto <bb 5>; [76.89%]
  else
    goto <bb 12>; [23.11%]
;;    succ:       5
;;                12

We find some jump threading opportunities.  What's most interesting to note is
the thread of control bb3->bb14->bb5->bb7.

This is an infeasible path -- it can't happen at runtime.  But we also can't
simplify it in DOM.  We need jump threading & cleanup_cfg to simplify the CFG. 
After DOM's jump threading pass is complete, the CFG is cleaned up and we've
incrementally updated the SSA form we'll have:

;;   basic block 3, loop depth 0
;;    pred:       2
  __builtin_strdup (spec_22(D));
  _41 = (long int) spec_22(D);
  _56 = -_41;
  ulen_57 = (size_t) _56;
  _58 = ulen_57 + 1;
  u_60 = __builtin_malloc (_58);
  if (u_60 != 0B)
    goto <bb 5>; [100.00%]
  else
    goto <bb 16>; [0.00%]
;;    succ:       5
;;                16

[ First note that the test originally from bb14 has been moved up into bb3 ]

;;   basic block 5, loop depth 0
;;    pred:       3
  # error_msg_40 = PHI <0B(3)>
  # iftmp.0_50 = PHI <0B(3)>
  endpwent ();
  _51 = 0;
  _52 = 1;
  _53 = _51;
  if (_51 != 0)
    goto <bb 7>; [0.00%]
  else
    goto <bb 9>; [100.00%]
;;    succ:       7
;;                9

The combination of threading and cfgcleanup has exposed the constants.   But
DOM is complete at this point and thus we don't get another change to optimize
the block.

And bb7 will look like:

;;   basic block 7, loop depth 0
;;    pred:       15
;;                11
;;                5
  # iftmp.0_68 = PHI <iftmp.0_27(15), iftmp.0_69(11), 0B(5)>
  _10 = *iftmp.0_68;
  if (_10 != 43)
    goto <bb 8>; [48.88%]
  else
    goto <bb 9>; [51.12%]
;;    succ:       8
;;                9

Notice how the value 0 flows in from bb5 for iftmp.0_68 which we then
dereference.  It's still an infeasible path, but without another DOM pass,
predicate analysis or something similar, we're going to issue the false
positive.

So we already know that iterated DOM/jump threading is too expensive.  We could
move the path isolation bits later -- like after dom3 which would fix this
regression, but probably introduce others.

Ultimately it's the usual phase ordering problem.

As I've noted in other BZs, I've never come up with an algorithm that would
allow us to do some kind of worklist approach to replicate iterated
dom+threading.


But I have speculated that we could get most of the benefit by identifying a
subset of blocks and running a local (block only) simplifier.  On my whiteboard
one of the filters to identify the subset of blocks would be blocks with
degenerate PHIs.  bb5 in this case would qualify and we'd see that bb5's
conditional is always false and thus the edge bb5->bb7 can be deleted which
would in turn eliminate the false positive.

One of the questions that came up for me was whether or not phi-only cprop
would help here.  It doesn't.  It only looks at degenerate PHIs to build its
worklist of SSA_NAMEs to try and propagate away.  Look at bb5.  While there are
degenerate PHIs, they do not feed anything useful.  We'd actually have to run a
more thorough constant propagation.  Something to ponder.

Anyway, wanted to get my thoughts recorded so I don't forget anything.

Reply via email to