https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65369
--- Comment #26 from Jakub Jelinek <jakub at gcc dot gnu.org> --- So, on my version of the testcase with r210843 -O3 -mcpu=power8 there are like 49 32 bit load in host endianness found at: _105 = MEM[(const unsigned char *)load_src_25]; occurrences, so I've added a quick hack (should have used dbg counters parhaps), and with BSWAPCNT=16 it works fine, with BSWAPCNT=17 it fails. In the *.optimized dump, I've noticed that this single load matters for vectorization in md4_update function, with BSWAPCNT=16 a chunk of code isn't vectorized, with BSWAPCNT=17 it is. --- tree-ssa-math-opts.c.xx 2015-03-12 17:44:13.000000000 +0100 +++ tree-ssa-math-opts.c 2015-03-12 18:52:49.280605232 +0100 @@ -2132,6 +2132,17 @@ bswap_replace (gimple stmt, gimple_stmt_ gimple addr_stmt, load_stmt; unsigned align; +static int cntx = -1; +if (cntx == -1) +{ +if (getenv ("BSWAPCNT")) +cntx = atoi (getenv ("BSWAPCNT")); +else +cntx = 0x7fffffff; +} +if (cntx == 0) +return false; +cntx--; align = get_object_alignment (src); if (bswap && SLOW_UNALIGNED_ACCESS (TYPE_MODE (load_type), align)) return false; So very well this might just trigger a latent bug in the vectorizer or powerpc backend.