https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80647
Bug ID: 80647 Summary: vectorized loop crashes from wrongly assuming 16 byte alignment Product: gcc Version: 6.1.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: yzhang1985 at gmail dot com Target Milestone: --- Created attachment 41328 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=41328&action=edit compiling with -O3 will reproduce the crash I'm getting a crash for a function that extracts a sub region of an image in-place. I compile with gcc -O3, which vectorizes the inner most loop, while (twd--) { *pintdest++ = *pintsrc++; } ---------------assembly------------------------- movdqa (%r10,%rax,1),%xmm0 add $0x1,%ecx movups %xmm0,(%rdx,%rax,1) ------------------------------------------------ It crashes on movdqa because the address isn't aligned. It should be using unaligned vector loads like movdqu or lddqu instead. I tested it with GCC 4.8 which did vectorize the loop correctly. Starting with Nehalem, there is no penalty for using unaligned loads/stores if the vector doesn't span 2 cache lines, so why not always generate unaligned loads/stores? It used to be that the other advantage to exploit for aligned data was to fuse the vector load/store with another instruction, reducing machine code size. But even that alignment restriction for memory operands was relaxed starting with SandyBridge's VEX instructions.