https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117498
--- Comment #3 from Jakub Jelinek <jakub at gcc dot gnu.org> --- This looks like lim4 bug to me (looking at #c2 with -O3). Before lim2 we have: <bb 2> [local count: 14598063]: c.0_16 ={v} c; if (c.0_16 == 0B) goto <bb 4>; [18.09%] else goto <bb 3>; [81.91%] <bb 3> [local count: 11957273]: <bb 4> [local count: 14598063]: # _17 = PHI <1(3), -1(2)> # prephitmp_26 = PHI <1(3), 255(2)> f = _17; n = 0; d.2_25 = d; if (d.2_25 <= 3) goto <bb 10>; [89.00%] else goto <bb 9>; [11.00%] <bb 10> [local count: 12992276]: ivtmp.140_38 = (unsigned int) d.2_25; goto <bb 7>; [100.00%] ... <bb 7> [local count: 118111600]: # ivtmp.140_45 = PHI <ivtmp.140_73(12), ivtmp.140_38(10)> if (prephitmp_26 != 1) goto <bb 29>; [89.00%] else goto <bb 6>; [11.00%] and only if prephitmp_26 is not 1 (it must be 255 then) we branch to bb 29 which has vectorized unrolled loop doing 251 iterations and invoking UB in that case (correctly). But prephitmp_26 is actually 1 at runtime, so no UB. But lim4 decides to hoist one of the vector loads from that bb 29 to bb 10, and that is not correct, because the load would be before UB only for prephitmp_26 != 1 and now it is unconditionally: <bb 10> [local count: 12992276]: ivtmp.140_38 = (unsigned int) d.2_25; + vect__12.117_6 = MEM <vector(16) char> [(char *)&n]; + g__lsm.141_333 = _30(D); + g__lsm_flag.142_315 = 0; goto <bb 7>; [100.00%] <bb 29> [local count: 105119324]: - vect__12.117_6 = MEM <vector(16) char> [(char *)&n]; vect__12.118_21 = MEM <vector(16) char> [(char *)&n + 16B]; vect__12.119_107 = MEM <vector(16) char> [(char *)&n + 32B]; vect__12.120_109 = MEM <vector(16) char> [(char *)&n + 48B]; The crash is on exactly that load: => 0x000000000040105d <+61>: movdqa 0x7(%rsp),%xmm4 because obviously %rsp+7 is not 16-byte aligned; n has just char type, so it is just fine when it is just 1-byte aligned.