https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63677
Jakub Jelinek <jakub at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|UNCONFIRMED |NEW Last reconfirmed| |2014-10-29 CC| |jakub at gcc dot gnu.org Ever confirmed|0 |1 --- Comment #3 from Jakub Jelinek <jakub at gcc dot gnu.org> --- The problem is that the loop is first vectorized, then several passes later slp vectorizes the initialization, so after some cleanups we have e.g. in cddce2: MEM[(int *)&a] = { 0, 1, 2, 3 }; MEM[(int *)&a + 16B] = { 4, 5, 6, 7 }; vect__13.6_20 = MEM[(int *)&a]; vect__13.6_17 = MEM[(int *)&a + 16B]; But there is no further FRE pass that would optimize the loads into vect__13.6_20 = { 0, 1, 2, 3 }; vect__13.6_17 = { 4, 5, 6, 7 }; (supposedly that would need to be done before forwprop4 that could in theory refold all the stmts into constant). Richard, how expensive would be to schedule another FRE pass if anything has been vectorized in the current function (either vect pass, or slp)? Or are there other passes that handle this? Looking at e.g. typedef int V __attribute__((vector_size (4 * sizeof (int)))); struct S { int a[4]; }; V __attribute__ ((noinline)) foo (struct S *p) { *(V *) p = (V) { 1, 2, 3, 4 }; return *(V *) p; } with -O2 -fno-tree-fre, it seems DOM is able to do that, but unfortunately at dom2 time the values have not been sufficiently forward propagated for dom2 to optimize this.