https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81165
Alexandre Oliva <aoliva at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |ASSIGNED Assignee|unassigned at gcc dot gnu.org |aoliva at gcc dot gnu.org --- Comment #13 from Alexandre Oliva <aoliva at gcc dot gnu.org> --- Created attachment 42800 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=42800&action=edit patch, first try This is my first cut at it. I couldn't quite figure out how to determine whether stmts were dead efficiently with a forward-pass (I'd have to keep track of all potentially dead stmts, probably without knowing for sure what's dead and what isn't till the very end), so I introduced it as a backward scan of all copied blocks in the path, that we only perform when we reach the stmt count limit, and that can tell right away whether each stmt is dead, so it also knows when to stop because we're past the limit. This indeed solves the problem reported here, but in a kind of weird way. We perform very different threading in earlier passes, and we end up with all the computation in the loop entry test all the way to dse3, past even thread4. It's cddce3 that optimizes it out, and later we further simplify the conditional set of t1, and its use, into a single (x1&1)!=0 (testl;je), even better than the code we generated at 7.x. However, the code at e.g. dom2 looks much uglier than what we get without this patch. One thing I haven't done, but I kind of feel might be useful, is to cache the killed_stmts computation per block within a single pass. I tried to cache it in the path itself, but (unsurprisingly) then we didn't ever get any hits :-) Do you believe this would be worthwhile? I gather it will have to touch a number of passes that call into the jump threading infrastructure to reset the cache. Another thought that occurred to me was that, instead of scanning insns backwards, I could start at the final control-flow stmt, and just follow the links from uses to defs. It seems like this could avoid the linear scan, but it would require a bit more complexity and additional memory than the hash_map I'm using; I can see it working efficiently with one hash_map mapping SSA_NAMEs to remaining uses within the block, and one vector or bitmap to hold dead defs pending back-propagation to their uses. It feels like this could be more efficient, but I'm hesitant because of the far more random access patterns it would entail. I guess I'll have to implement it and benchmark it to know for sure...