http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52252
--- Comment #2 from Stupachenko Evgeny <evstupac at gmail dot com> 2012-02-29 12:32:20 UTC --- The difference of 2 dumps from Arm: gcc -O3 -mfpu=neon test.c -S -ftree-vectorizer-verbose=12 X86: gcc -O3 -m32 -msse3 test.c -S -ftree-vectorizer-verbose=12 Starts at: For Arm (can use vec_load_lanes): 6: === vect_make_slp_decision === 6: === vect_detect_hybrid_slp === 6: === vect_analyze_loop_operations === 6: examining phi: in_35 = PHI <in_22(7), in_5(D)(4)> …… 6: can use vec_load_lanes<CI><V16QI> 6: vect_model_load_cost: unaligned supported by hardware. 6: vect_model_load_cost: inside_cost = 2, outside_cost = 0 . For x86 (no array mode for V16QI[3]): 6: === vect_make_slp_decision === 6: === vect_detect_hybrid_slp === 6: === vect_analyze_loop_operations === 6: examining phi: in_35 = PHI <in_22(7), in_5(D)(4)> .…… 6: no array mode for V16QI[3] 6: the size of the group of strided accesses is not a power of 2 6: not vectorized: relevant stmt not supported: r_8 = *in_35; As I mentioned before, there is an ability for x86 to handle this (Arm can shuffle than loads, x86 can use pshufb).