http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51254
Bug #: 51254 Summary: Missed Optimization: IVOPTS don't handle unaligned memory access. Classification: Unclassified Product: gcc Version: 4.7.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization AssignedTo: unassig...@gcc.gnu.org ReportedBy: duyue...@gmail.com IVOPTS don't handle unaligned memory access because http://gcc.gnu.org/bugzilla/show_bug.cgi?id=17949. but without this optimization, we may generate sub-optimal code. here is a case from EEMBC autcor00: fxpAutoCorrelation ( e_s16 *InputData, e_s16 *AutoCorrData, e_s16 DataSize, e_s16 NumberOfLags, e_s16 Scale ) { n_int i; n_int lag; n_int LastIndex; e_s32 Accumulator; for (lag = 0; lag < NumberOfLags; lag++) { Accumulator = 0; LastIndex = DataSize - lag; for (i = 0; i < LastIndex; i++) { Accumulator += ((e_s32) InputData[i] * (e_s32) InputData[i+lag]) >> Scale; } AutoCorrData[lag] = (e_s16) (Accumulator >> 16) ; } } Compile it with a arm cross-compiler Compile flags: -O3 -mfpu=neon -mfloat-abi=softfp the key vectorized loop is: .L8: add r7, ip, sl vldmia ip, {d18-d19} vld1.16 {q8}, [r7] vmull.s16 q12, d18, d16 vshl.s32 q12, q12, q11 vmull.s16 q8, d19, d17 add r4, r4, #1 vadd.i32 q10, q12, q10 vshl.s32 q8, q8, q11 cmp r4, r8 vadd.i32 q10, q8, q10 add ip, ip, #16 bcc .L8 There are three ADD insn in it which used to calculate address and loop counter, but we can see we only need one ADD insn for calculating loop counter, other two can be optimized with address post increment operation. The root cause of this is because IVOPTS don't handle unaligned memory access. if we remove those check in find_interesting_uses_address, the result is: .L8: vldmia r6!, {d18-d19} vld1.16 {q8}, [r7]! vmull.s16 q12, d18, d16 vshl.s32 q12, q12, q11 vmull.s16 q8, d19, d17 add r4, r4, #1 vadd.i32 q10, q12, q10 vshl.s32 q8, q8, q11 cmp r4, sl vadd.i32 q10, q8, q10 bcc .L8 This should be the result we want. see http://gcc.gnu.org/ml/gcc/2011-11/msg00311.html for more details.