Hi! On the following testcase with -m64 -O3 -mavx2 (but it is just an example, you can replace the loop there with any code that doesn't touch the stack or frame pointer at all), only f3 is shrink wrapped and in that case it on the other side doesn't add vzeroupper before leaving the AVX using code that it IMNSHO should. But I wonder why we can't shrink-wrap also the first two testcases (well, in the second testcase it wouldn't be book shrink-wrapping, but essentially throwing away the prologue/epilogue).
>From quick look, f1 isn't shrink-wrapped probably because of the set of bb's that need prologue/epilogue around it doesn't end in a return, but in a tail call. Can't we just add a prologue before the bar call and throw the epilogue away (normally the epilogue in a function that ends only in a tail call is just emitted after the barrier and optimized away I think, we could do the same?). And f2 is something that IMHO with especially AVX/AVX2 code happens very often, the prologue is expensive as it realigns the stack. The reason for that is that until reload we don't know whether something won't be spilled on the stack and we need/want 32-byte aligned stack slots for that spilling. Isn't the case when none of the bbs actually need stack/frame pointer just a special case of shrink wrapping? Can't we either throw the prologue/epilogue away then and just end the function in simple_return? f4 is another test case for the same thing, this time with no AVX/AVX2 intrinsics, but which the vectorizer vectorizes using 256-bit vectors. #include <x86intrin.h> __m256i a[16], b[16], f; __m256d g[16], h; extern void bar (void); extern void baz (void); void f1 (int c) { int i; if (c) for (i = 0; i < 16; i++) a[i] = _mm256_i64gather_epi64 (NULL, b[i], 1); else { bar (); baz (); } } void f2 (void) { int i; for (i = 0; i < 16; i++) a[i] = _mm256_i64gather_epi64 (NULL, b[i], 1); } int f3 (int c) { int i; if (c) for (i = 0; i < 16; i++) a[i] = _mm256_i64gather_epi64 (NULL, b[i], 1); else { bar (); baz (); } return c; } float x[8], y[8]; void f4 (void) { int i; for (i = 0; i < 8; i++) x[i] = y[i] * 2 - x[i]; } Jakub