I am working on a port for the TriMedia processor family, and I was playing around with the following example (extracted from gcc.c-torture/execute/simd-2.c) to see how well our port takes advantage of the addv2hi3 operator of our tm3271 processor.
test.c: ... typedef short __attribute__((vector_size (N))) vecint; vecint i, j, k; void f () { k = i + j; } ... test.c.016veclower, N=4. This looks good, the addv2hi3 has been used once. ... vector short int k.2; vector short int j.1; vector short int i.0; i.0 = i; j.1 = j; k.2 = i.0 + j.1; k = k.2; } ... test.c.016veclower, N=32. This also looks good, the addv2hi3 has been used 8x. ... vector short unsigned int D.1445; vector short unsigned int D.1444; vector short unsigned int D.1443; vector short unsigned int D.1442; vector short unsigned int D.1441; vector short unsigned int D.1440; vector short unsigned int D.1439; vector short unsigned int D.1438; vector short unsigned int D.1437; vector short unsigned int D.1436; vector short unsigned int D.1435; vector short unsigned int D.1434; vector short unsigned int D.1433; vector short unsigned int D.1432; vector short unsigned int D.1431; vector short unsigned int D.1430; vector short unsigned int D.1429; vector short unsigned int D.1428; vector short unsigned int D.1427; vector short unsigned int D.1426; vector short unsigned int D.1425; vector short unsigned int D.1424; vector short unsigned int D.1423; vector short unsigned int D.1422; vector short int k.2; vector short int j.1; vector short int i.0; i.0 = i; j.1 = j; D.1422 = BIT_FIELD_REF <i.0, 32, 0>; D.1423 = BIT_FIELD_REF <j.1, 32, 0>; D.1424 = D.1422 + D.1423; D.1425 = BIT_FIELD_REF <i.0, 32, 32>; D.1426 = BIT_FIELD_REF <j.1, 32, 32>; D.1427 = D.1425 + D.1426; D.1428 = BIT_FIELD_REF <i.0, 32, 64>; D.1429 = BIT_FIELD_REF <j.1, 32, 64>; D.1430 = D.1428 + D.1429; D.1431 = BIT_FIELD_REF <i.0, 32, 96>; D.1432 = BIT_FIELD_REF <j.1, 32, 96>; D.1433 = D.1431 + D.1432; D.1434 = BIT_FIELD_REF <i.0, 32, 128>; D.1435 = BIT_FIELD_REF <j.1, 32, 128>; D.1436 = D.1434 + D.1435; D.1437 = BIT_FIELD_REF <i.0, 32, 160>; D.1438 = BIT_FIELD_REF <j.1, 32, 160>; D.1439 = D.1437 + D.1438; D.1440 = BIT_FIELD_REF <i.0, 32, 192>; D.1441 = BIT_FIELD_REF <j.1, 32, 192>; D.1442 = D.1440 + D.1441; D.1443 = BIT_FIELD_REF <i.0, 32, 224>; D.1444 = BIT_FIELD_REF <j.1, 32, 224>; D.1445 = D.1443 + D.1444; k.2 = {D.1424, D.1427, D.1430, D.1433, D.1436, D.1439, D.1442, D.1445}; k = k.2; ... test.c.016veclower, N=8. This does not look good. The addv2hi3 has not been used. The addsi3 has been used 4 times, while the addv2hi3 could have been used only 2 times. ... short int D.1431; short int D.1430; short int D.1429; short int D.1428; short int D.1427; short int D.1426; short int D.1425; short int D.1424; short int D.1423; short int D.1422; short int D.1421; short int D.1420; vector short int k.2; vector short int j.1; vector short int i.0; i.0 = i; j.1 = j; D.1420 = BIT_FIELD_REF <i.0, 16, 0>; D.1421 = BIT_FIELD_REF <j.1, 16, 0>; D.1422 = D.1420 + D.1421; D.1423 = BIT_FIELD_REF <i.0, 16, 16>; D.1424 = BIT_FIELD_REF <j.1, 16, 16>; D.1425 = D.1423 + D.1424; D.1426 = BIT_FIELD_REF <i.0, 16, 32>; D.1427 = BIT_FIELD_REF <j.1, 16, 32>; D.1428 = D.1426 + D.1427; D.1429 = BIT_FIELD_REF <i.0, 16, 48>; D.1430 = BIT_FIELD_REF <j.1, 16, 48>; D.1431 = D.1429 + D.1430; k.2 = {D.1422, D.1425, D.1428, D.1431}; k = k.2; ... This grep illustrates that the problem only occurs for N=8/16: ... $ for N in 4 8 16 32 64; do \ rm -f *.c.* ; \ cc1 test.c -quiet -march=tm3271 -O2 -DN=${N} \ -fdump-rtl-all -fdump-tree-all \ && grep -c '+' test.c.016t.veclower ; \ done 1 4 8 8 16 ... So why does the problem occur? Lets look at the TYPE_MODE (type) in expand_vector_operations_1() for different values of N: ... N=4 V2HI N=8 DImode N=16 TImode N=32 BLKmode N=64 BLKmode ... For the DImode and TImode, we don't generate efficient code, due to the test on BLKmode: ... /* For very wide vectors, try using a smaller vector mode. */ compute_type = type; if (TYPE_MODE (type) == BLKmode && op) ... in expand_vector_operations_1(). For my target, which has a native addv2hi3 operator, also DImode/TImode can be considered a 'wide vector'. Using this patch, I also generate addv2hi3 for N=8/N=16: ... Index: tree-vect-generic.c =================================================================== --- tree-vect-generic.c (revision 14) +++ tree-vect-generic.c (working copy) @@ -462,7 +462,7 @@ /* For very wide vectors, try using a smaller vector mode. */ compute_type = type; - if (TYPE_MODE (type) == BLKmode && op) + if (op) { tree vector_compute_type = type_for_widest_vector_mode (TYPE_MODE (TREE_TYPE (type)), op, ... Furthermore, I think this patch (in the style of expmed.c:extract_bit_field_1()) could be useful: ... Index: tree-vect-generic.c =================================================================== --- tree-vect-generic.c (revision 14) +++ tree-vect-generic.c (working copy) @@ -35,6 +35,7 @@ #include "tree-pass.h" #include "flags.h" #include "ggc.h" +#include "target.h" /* Build a constant of type TYPE, made of VALUE's bits replicated @@ -369,6 +370,7 @@ for (; mode != VOIDmode; mode = GET_MODE_WIDER_MODE (mode)) if (GET_MODE_INNER (mode) == inner_mode && GET_MODE_NUNITS (mode) > best_nunits + && targetm.vector_mode_supported_p(mode) && optab_handler (op, mode)->insn_code != CODE_FOR_nothing) best_mode = mode, best_nunits = GET_MODE_NUNITS (mode); ... It automatically disables a addv4hi3 if v4hi is disabled in TARGET_VECTOR_MODE_SUPPORTED_P. -- Summary: efficiency problem with V2HI add Product: gcc Version: 4.3.0 Status: UNCONFIRMED Severity: enhancement Priority: P3 Component: middle-end AssignedTo: unassigned at gcc dot gnu dot org ReportedBy: Tom dot de dot Vries at Nxp dot com http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40589