http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52252

--- Comment #2 from Stupachenko Evgeny <evstupac at gmail dot com> 2012-02-29 
12:32:20 UTC ---
The difference of 2 dumps from

Arm: gcc -O3 -mfpu=neon test.c -S -ftree-vectorizer-verbose=12
X86: gcc -O3 -m32 -msse3 test.c -S -ftree-vectorizer-verbose=12

Starts at:

For Arm (can use vec_load_lanes):

6: === vect_make_slp_decision === 
6: === vect_detect_hybrid_slp ===
6: === vect_analyze_loop_operations ===
6: examining phi: in_35 = PHI <in_22(7), in_5(D)(4)>

……

6: can use vec_load_lanes<CI><V16QI> 
6: vect_model_load_cost: unaligned supported by hardware. 
6: vect_model_load_cost: inside_cost = 2, outside_cost = 0 .

For x86 (no array mode for V16QI[3]):

6: === vect_make_slp_decision === 
6: === vect_detect_hybrid_slp === 
6: === vect_analyze_loop_operations === 
6: examining phi: in_35 = PHI <in_22(7), in_5(D)(4)> 

.……

6: no array mode for V16QI[3] 
6: the size of the group of strided accesses is not a power of 2 
6: not vectorized: relevant stmt not supported: r_8 = *in_35; 

As I mentioned before, there is an ability for x86 to handle this (Arm can
shuffle than loads, x86 can use pshufb).

Reply via email to