On Fri, Mar 22, 2019 at 11:11:58AM +0100, Uros Bizjak wrote: > > For FMA, naturally only the two operands that are multiplied should be > > commutative, but in most patterns one of those two uses "0" or "0,0" > > This should be safe, we have had "*add<mode>_1" for decades that does > just the above.
Sure, the 0 isn't a problem in itself. > > constraint and there is one or two match_dup 1 for it, so it really > > isn't commutative. > > Hm, this situation involving match_dup needs some more thinking... But this one is. If one reads the documentation https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm_mask_fmadd_sd&expand=5236,2545 or https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm_maskz_fmadd_sd&expand=5236,2545,2547 then even in the source form related description the a and b arguments aren't commutative, because a is used 3 or 2 times while b is used just once. Compare to https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm_mask3_fmadd_sd&expand=5236,2545,2547,2546 where a and b are both used just once and which thus can be commutative, only c is used 3 times. For this reason, even _mm{,256,512}_mask_f{,n}m{add,sub}_p{s,d} aren't using % and IMHO can't. Now, _mm{,256,512}_maskz_f{,n}m{add,sub}_p{s,d} actually do use % and can, because both a and b are used just once, it is filled with zeros if mask bit is 0. That is different from _mm_maskz_f{,n}m{add,sub}_s{s,d} which fills with 0 only the first element and the elements above it are copied from a. > > Which leaves us with the 4 mask3 patterns only, as I said above, for > > the first two where neither of those are negated I think % should be ok. > > For the .*fnm{add,sub}.*mask3.* ones I'm not sure, because one of them > > is negated. On the other side, seems various other existing fnm* > > patterns use % even on those. > > It is safe to use even if one of the first two operands is negated. > According to the documentation, the negation represents negation of > the intermediate product, so it doesn't matter which operand is > negated. So like this if it passes bootstrap/regtest? PS., it would be nice to have some testsuite coverage also for cases where the intrinsics are called from noipa wrapper functions and one of the arguments is __m128{,d} *x and passes *x to the intrinsic (repeated so that we test all non-mask arguments that way). And it would be nice to also have testcases with constant -1 masks and with constant 0 mask. I just compiled attached sources and eyeballed the result that at least on some cases it performed the expected simplifications, but probably don't have spare cycles now to turn that all into something suitable for the testsuite (ideally it would be both test for specific instructions and runtime test). 2019-03-22 Jakub Jelinek <ja...@redhat.com> * config/i386/sse.md (<avx512>_fmadd_<mode>_mask3<round_name>, <avx512>_fmsub_<mode>_mask3<round_name>, <avx512>_fnmadd_<mode>_mask3<round_name>, <avx512>_fnmsub_<mode>_mask3<round_name>, avx512f_vmfmadd_<mode>_mask3<round_name>, avx512f_vmfmsub_<mode>_mask3<round_name>, *avx512f_vmfnmadd_<mode>_mask3<round_name>): Use <round_nimm_predicate> instead of register_operand and %v instead of v for match_operand 1. (avx512f_vmfnmsub_<mode>_mask3<round_name>): Rename to ... (*avx512f_vmfnmsub_<mode>_mask3<round_name>): ... this. Use <round_nimm_predicate> instead of register_operand and %v instead of v for match_operand 1. --- gcc/config/i386/sse.md.jj 2019-03-22 11:11:58.330060594 +0100 +++ gcc/config/i386/sse.md 2019-03-22 11:21:12.901952453 +0100 @@ -3973,7 +3973,7 @@ (define_insn "<avx512>_fmadd_<mode>_mask [(set (match_operand:VF_AVX512VL 0 "register_operand" "=v") (vec_merge:VF_AVX512VL (fma:VF_AVX512VL - (match_operand:VF_AVX512VL 1 "register_operand" "v") + (match_operand:VF_AVX512VL 1 "<round_nimm_predicate>" "%v") (match_operand:VF_AVX512VL 2 "<round_nimm_predicate>" "<round_constraint>") (match_operand:VF_AVX512VL 3 "register_operand" "0")) (match_dup 3) @@ -4094,7 +4094,7 @@ (define_insn "<avx512>_fmsub_<mode>_mask [(set (match_operand:VF_AVX512VL 0 "register_operand" "=v") (vec_merge:VF_AVX512VL (fma:VF_AVX512VL - (match_operand:VF_AVX512VL 1 "register_operand" "v") + (match_operand:VF_AVX512VL 1 "<round_nimm_predicate>" "%v") (match_operand:VF_AVX512VL 2 "<round_nimm_predicate>" "<round_constraint>") (neg:VF_AVX512VL (match_operand:VF_AVX512VL 3 "register_operand" "0"))) @@ -4217,7 +4217,7 @@ (define_insn "<avx512>_fnmadd_<mode>_mas (vec_merge:VF_AVX512VL (fma:VF_AVX512VL (neg:VF_AVX512VL - (match_operand:VF_AVX512VL 1 "register_operand" "v")) + (match_operand:VF_AVX512VL 1 "<round_nimm_predicate>" "%v")) (match_operand:VF_AVX512VL 2 "<round_nimm_predicate>" "<round_constraint>") (match_operand:VF_AVX512VL 3 "register_operand" "0")) (match_dup 3) @@ -4345,7 +4345,7 @@ (define_insn "<avx512>_fnmsub_<mode>_mas (vec_merge:VF_AVX512VL (fma:VF_AVX512VL (neg:VF_AVX512VL - (match_operand:VF_AVX512VL 1 "register_operand" "v")) + (match_operand:VF_AVX512VL 1 "<round_nimm_predicate>" "%v")) (match_operand:VF_AVX512VL 2 "<round_nimm_predicate>" "<round_constraint>") (neg:VF_AVX512VL (match_operand:VF_AVX512VL 3 "register_operand" "0"))) @@ -4667,7 +4667,7 @@ (define_insn "avx512f_vmfmadd_<mode>_mas (vec_merge:VF_128 (vec_merge:VF_128 (fma:VF_128 - (match_operand:VF_128 1 "register_operand" "v") + (match_operand:VF_128 1 "<round_nimm_predicate>" "%v") (match_operand:VF_128 2 "<round_nimm_predicate>" "<round_constraint>") (match_operand:VF_128 3 "register_operand" "0")) (match_dup 3) @@ -4737,7 +4737,7 @@ (define_insn "avx512f_vmfmsub_<mode>_mas (vec_merge:VF_128 (vec_merge:VF_128 (fma:VF_128 - (match_operand:VF_128 1 "register_operand" "v") + (match_operand:VF_128 1 "<round_nimm_predicate>" "%v") (match_operand:VF_128 2 "<round_nimm_predicate>" "<round_constraint>") (neg:VF_128 (match_operand:VF_128 3 "register_operand" "0"))) @@ -4797,7 +4797,7 @@ (define_insn "*avx512f_vmfnmadd_<mode>_m (fma:VF_128 (neg:VF_128 (match_operand:VF_128 2 "<round_nimm_predicate>" "<round_constraint>")) - (match_operand:VF_128 1 "register_operand" "v") + (match_operand:VF_128 1 "<round_nimm_predicate>" "%v") (match_operand:VF_128 3 "register_operand" "0")) (match_dup 3) (match_operand:QI 4 "register_operand" "Yk")) @@ -4849,14 +4849,14 @@ (define_insn "*avx512f_vmfnmsub_<mode>_m [(set_attr "type" "ssemuladd") (set_attr "mode" "<MODE>")]) -(define_insn "avx512f_vmfnmsub_<mode>_mask3<round_name>" +(define_insn "*avx512f_vmfnmsub_<mode>_mask3<round_name>" [(set (match_operand:VF_128 0 "register_operand" "=v") (vec_merge:VF_128 (vec_merge:VF_128 (fma:VF_128 (neg:VF_128 (match_operand:VF_128 2 "<round_nimm_predicate>" "<round_constraint>")) - (match_operand:VF_128 1 "register_operand" "v") + (match_operand:VF_128 1 "<round_nimm_predicate>" "%v") (neg:VF_128 (match_operand:VF_128 3 "register_operand" "0"))) (match_dup 3) Jakub
#include <x86intrin.h> __m128d f1 (__m128d a, __mmask8 m, __m128d b, __m128d c) { return _mm_mask_fmadd_sd (a, m, b, c); } __m128d f2 (__m128d a, __m128d b, __m128d c, __mmask8 m) { return _mm_mask3_fmadd_sd (a, b, c, m); } __m128d f3 (__mmask8 m, __m128d a, __m128d b, __m128d c) { return _mm_maskz_fmadd_sd (m, a, b, c); } __m128d f4 (__m128d a, __mmask8 m, __m128d b, __m128d c) { return _mm_mask_fmadd_round_sd (a, m, b, c, 9); } __m128d f5 (__m128d a, __m128d b, __m128d c, __mmask8 m) { return _mm_mask3_fmadd_round_sd (a, b, c, m, 10); } __m128d f6 (__mmask8 m, __m128d a, __m128d b, __m128d c) { return _mm_maskz_fmadd_round_sd (m, a, b, c, 11); } __m128 f7 (__m128 a, __mmask8 m, __m128 b, __m128 c) { return _mm_mask_fmadd_ss (a, m, b, c); } __m128 f8 (__m128 a, __m128 b, __m128 c, __mmask8 m) { return _mm_mask3_fmadd_ss (a, b, c, m); } __m128 f9 (__mmask8 m, __m128 a, __m128 b, __m128 c) { return _mm_maskz_fmadd_ss (m, a, b, c); } __m128 f10 (__m128 a, __mmask8 m, __m128 b, __m128 c) { return _mm_mask_fmadd_round_ss (a, m, b, c, 9); } __m128 f11 (__m128 a, __m128 b, __m128 c, __mmask8 m) { return _mm_mask3_fmadd_round_ss (a, b, c, m, 10); } __m128 f12 (__mmask8 m, __m128 a, __m128 b, __m128 c) { return _mm_maskz_fmadd_round_ss (m, a, b, c, 11); } __m128d f13 (__m128d a, __mmask8 m, __m128d b, __m128d c) { return _mm_mask_fmsub_sd (a, m, b, c); } __m128d f14 (__m128d a, __m128d b, __m128d c, __mmask8 m) { return _mm_mask3_fmsub_sd (a, b, c, m); } __m128d f15 (__mmask8 m, __m128d a, __m128d b, __m128d c) { return _mm_maskz_fmsub_sd (m, a, b, c); } __m128d f16 (__m128d a, __mmask8 m, __m128d b, __m128d c) { return _mm_mask_fmsub_round_sd (a, m, b, c, 9); } __m128d f17 (__m128d a, __m128d b, __m128d c, __mmask8 m) { return _mm_mask3_fmsub_round_sd (a, b, c, m, 10); } __m128d f18 (__mmask8 m, __m128d a, __m128d b, __m128d c) { return _mm_maskz_fmsub_round_sd (m, a, b, c, 11); } __m128 f19 (__m128 a, __mmask8 m, __m128 b, __m128 c) { return _mm_mask_fmsub_ss (a, m, b, c); } __m128 f20 (__m128 a, __m128 b, __m128 c, __mmask8 m) { return _mm_mask3_fmsub_ss (a, b, c, m); } __m128 f21 (__mmask8 m, __m128 a, __m128 b, __m128 c) { return _mm_maskz_fmsub_ss (m, a, b, c); } __m128 f22 (__m128 a, __mmask8 m, __m128 b, __m128 c) { return _mm_mask_fmsub_round_ss (a, m, b, c, 9); } __m128 f23 (__m128 a, __m128 b, __m128 c, __mmask8 m) { return _mm_mask3_fmsub_round_ss (a, b, c, m, 10); } __m128 f24 (__mmask8 m, __m128 a, __m128 b, __m128 c) { return _mm_maskz_fmsub_round_ss (m, a, b, c, 11); } __m128d f25 (__m128d a, __mmask8 m, __m128d b, __m128d c) { return _mm_mask_fnmadd_sd (a, m, b, c); } __m128d f26 (__m128d a, __m128d b, __m128d c, __mmask8 m) { return _mm_mask3_fnmadd_sd (a, b, c, m); } __m128d f27 (__mmask8 m, __m128d a, __m128d b, __m128d c) { return _mm_maskz_fnmadd_sd (m, a, b, c); } __m128d f28 (__m128d a, __mmask8 m, __m128d b, __m128d c) { return _mm_mask_fnmadd_round_sd (a, m, b, c, 9); } __m128d f29 (__m128d a, __m128d b, __m128d c, __mmask8 m) { return _mm_mask3_fnmadd_round_sd (a, b, c, m, 10); } __m128d f30 (__mmask8 m, __m128d a, __m128d b, __m128d c) { return _mm_maskz_fnmadd_round_sd (m, a, b, c, 11); } __m128 f31 (__m128 a, __mmask8 m, __m128 b, __m128 c) { return _mm_mask_fnmadd_ss (a, m, b, c); } __m128 f32 (__m128 a, __m128 b, __m128 c, __mmask8 m) { return _mm_mask3_fnmadd_ss (a, b, c, m); } __m128 f33 (__mmask8 m, __m128 a, __m128 b, __m128 c) { return _mm_maskz_fnmadd_ss (m, a, b, c); } __m128 f34 (__m128 a, __mmask8 m, __m128 b, __m128 c) { return _mm_mask_fnmadd_round_ss (a, m, b, c, 9); } __m128 f35 (__m128 a, __m128 b, __m128 c, __mmask8 m) { return _mm_mask3_fnmadd_round_ss (a, b, c, m, 10); } __m128 f36 (__mmask8 m, __m128 a, __m128 b, __m128 c) { return _mm_maskz_fnmadd_round_ss (m, a, b, c, 11); } __m128d f37 (__m128d a, __mmask8 m, __m128d b, __m128d c) { return _mm_mask_fnmsub_sd (a, m, b, c); } __m128d f38 (__m128d a, __m128d b, __m128d c, __mmask8 m) { return _mm_mask3_fnmsub_sd (a, b, c, m); } __m128d f39 (__mmask8 m, __m128d a, __m128d b, __m128d c) { return _mm_maskz_fnmsub_sd (m, a, b, c); } __m128d f40 (__m128d a, __mmask8 m, __m128d b, __m128d c) { return _mm_mask_fnmsub_round_sd (a, m, b, c, 9); } __m128d f41 (__m128d a, __m128d b, __m128d c, __mmask8 m) { return _mm_mask3_fnmsub_round_sd (a, b, c, m, 10); } __m128d f42 (__mmask8 m, __m128d a, __m128d b, __m128d c) { return _mm_maskz_fnmsub_round_sd (m, a, b, c, 11); } __m128 f43 (__m128 a, __mmask8 m, __m128 b, __m128 c) { return _mm_mask_fnmsub_ss (a, m, b, c); } __m128 f44 (__m128 a, __m128 b, __m128 c, __mmask8 m) { return _mm_mask3_fnmsub_ss (a, b, c, m); } __m128 f45 (__mmask8 m, __m128 a, __m128 b, __m128 c) { return _mm_maskz_fnmsub_ss (m, a, b, c); } __m128 f46 (__m128 a, __mmask8 m, __m128 b, __m128 c) { return _mm_mask_fnmsub_round_ss (a, m, b, c, 9); } __m128 f47 (__m128 a, __m128 b, __m128 c, __mmask8 m) { return _mm_mask3_fnmsub_round_ss (a, b, c, m, 10); } __m128 f48 (__mmask8 m, __m128 a, __m128 b, __m128 c) { return _mm_maskz_fnmsub_round_ss (m, a, b, c, 11); }
#include <x86intrin.h> __m128d f1 (__m128d a, __m128d b, __m128d c) { return _mm_mask_fmadd_sd (a, -1, b, c); } __m128d f2 (__m128d a, __m128d b, __m128d c) { return _mm_mask3_fmadd_sd (a, b, c, -1); } __m128d f3 (__m128d a, __m128d b, __m128d c) { return _mm_maskz_fmadd_sd (-1, a, b, c); } __m128d f4 (__m128d a, __m128d b, __m128d c) { return _mm_mask_fmadd_round_sd (a, -1, b, c, 9); } __m128d f5 (__m128d a, __m128d b, __m128d c) { return _mm_mask3_fmadd_round_sd (a, b, c, -1, 10); } __m128d f6 (__m128d a, __m128d b, __m128d c) { return _mm_maskz_fmadd_round_sd (-1, a, b, c, 11); } __m128 f7 (__m128 a, __m128 b, __m128 c) { return _mm_mask_fmadd_ss (a, -1, b, c); } __m128 f8 (__m128 a, __m128 b, __m128 c) { return _mm_mask3_fmadd_ss (a, b, c, -1); } __m128 f9 (__m128 a, __m128 b, __m128 c) { return _mm_maskz_fmadd_ss (-1, a, b, c); } __m128 f10 (__m128 a, __m128 b, __m128 c) { return _mm_mask_fmadd_round_ss (a, -1, b, c, 9); } __m128 f11 (__m128 a, __m128 b, __m128 c) { return _mm_mask3_fmadd_round_ss (a, b, c, -1, 10); } __m128 f12 (__m128 a, __m128 b, __m128 c) { return _mm_maskz_fmadd_round_ss (-1, a, b, c, 11); } __m128d f13 (__m128d a, __m128d b, __m128d c) { return _mm_mask_fmsub_sd (a, -1, b, c); } __m128d f14 (__m128d a, __m128d b, __m128d c) { return _mm_mask3_fmsub_sd (a, b, c, -1); } __m128d f15 (__m128d a, __m128d b, __m128d c) { return _mm_maskz_fmsub_sd (-1, a, b, c); } __m128d f16 (__m128d a, __m128d b, __m128d c) { return _mm_mask_fmsub_round_sd (a, -1, b, c, 9); } __m128d f17 (__m128d a, __m128d b, __m128d c) { return _mm_mask3_fmsub_round_sd (a, b, c, -1, 10); } __m128d f18 (__m128d a, __m128d b, __m128d c) { return _mm_maskz_fmsub_round_sd (-1, a, b, c, 11); } __m128 f19 (__m128 a, __m128 b, __m128 c) { return _mm_mask_fmsub_ss (a, -1, b, c); } __m128 f20 (__m128 a, __m128 b, __m128 c) { return _mm_mask3_fmsub_ss (a, b, c, -1); } __m128 f21 (__m128 a, __m128 b, __m128 c) { return _mm_maskz_fmsub_ss (-1, a, b, c); } __m128 f22 (__m128 a, __m128 b, __m128 c) { return _mm_mask_fmsub_round_ss (a, -1, b, c, 9); } __m128 f23 (__m128 a, __m128 b, __m128 c) { return _mm_mask3_fmsub_round_ss (a, b, c, -1, 10); } __m128 f24 (__m128 a, __m128 b, __m128 c) { return _mm_maskz_fmsub_round_ss (-1, a, b, c, 11); } __m128d f25 (__m128d a, __m128d b, __m128d c) { return _mm_mask_fnmadd_sd (a, -1, b, c); } __m128d f26 (__m128d a, __m128d b, __m128d c) { return _mm_mask3_fnmadd_sd (a, b, c, -1); } __m128d f27 (__m128d a, __m128d b, __m128d c) { return _mm_maskz_fnmadd_sd (-1, a, b, c); } __m128d f28 (__m128d a, __m128d b, __m128d c) { return _mm_mask_fnmadd_round_sd (a, -1, b, c, 9); } __m128d f29 (__m128d a, __m128d b, __m128d c) { return _mm_mask3_fnmadd_round_sd (a, b, c, -1, 10); } __m128d f30 (__m128d a, __m128d b, __m128d c) { return _mm_maskz_fnmadd_round_sd (-1, a, b, c, 11); } __m128 f31 (__m128 a, __m128 b, __m128 c) { return _mm_mask_fnmadd_ss (a, -1, b, c); } __m128 f32 (__m128 a, __m128 b, __m128 c) { return _mm_mask3_fnmadd_ss (a, b, c, -1); } __m128 f33 (__m128 a, __m128 b, __m128 c) { return _mm_maskz_fnmadd_ss (-1, a, b, c); } __m128 f34 (__m128 a, __m128 b, __m128 c) { return _mm_mask_fnmadd_round_ss (a, -1, b, c, 9); } __m128 f35 (__m128 a, __m128 b, __m128 c) { return _mm_mask3_fnmadd_round_ss (a, b, c, -1, 10); } __m128 f36 (__m128 a, __m128 b, __m128 c) { return _mm_maskz_fnmadd_round_ss (-1, a, b, c, 11); } __m128d f37 (__m128d a, __m128d b, __m128d c) { return _mm_mask_fnmsub_sd (a, -1, b, c); } __m128d f38 (__m128d a, __m128d b, __m128d c) { return _mm_mask3_fnmsub_sd (a, b, c, -1); } __m128d f39 (__m128d a, __m128d b, __m128d c) { return _mm_maskz_fnmsub_sd (-1, a, b, c); } __m128d f40 (__m128d a, __m128d b, __m128d c) { return _mm_mask_fnmsub_round_sd (a, -1, b, c, 9); } __m128d f41 (__m128d a, __m128d b, __m128d c) { return _mm_mask3_fnmsub_round_sd (a, b, c, -1, 10); } __m128d f42 (__m128d a, __m128d b, __m128d c) { return _mm_maskz_fnmsub_round_sd (-1, a, b, c, 11); } __m128 f43 (__m128 a, __m128 b, __m128 c) { return _mm_mask_fnmsub_ss (a, -1, b, c); } __m128 f44 (__m128 a, __m128 b, __m128 c) { return _mm_mask3_fnmsub_ss (a, b, c, -1); } __m128 f45 (__m128 a, __m128 b, __m128 c) { return _mm_maskz_fnmsub_ss (-1, a, b, c); } __m128 f46 (__m128 a, __m128 b, __m128 c) { return _mm_mask_fnmsub_round_ss (a, -1, b, c, 9); } __m128 f47 (__m128 a, __m128 b, __m128 c) { return _mm_mask3_fnmsub_round_ss (a, b, c, -1, 10); } __m128 f48 (__m128 a, __m128 b, __m128 c) { return _mm_maskz_fnmsub_round_ss (-1, a, b, c, 11); }
#include <x86intrin.h> __m128d f1 (__m128d a, __m128d b, __m128d c) { return _mm_mask_fmadd_sd (a, 0, b, c); } __m128d f2 (__m128d a, __m128d b, __m128d c) { return _mm_mask3_fmadd_sd (a, b, c, 0); } __m128d f3 (__m128d a, __m128d b, __m128d c) { return _mm_maskz_fmadd_sd (0, a, b, c); } __m128d f4 (__m128d a, __m128d b, __m128d c) { return _mm_mask_fmadd_round_sd (a, 0, b, c, 9); } __m128d f5 (__m128d a, __m128d b, __m128d c) { return _mm_mask3_fmadd_round_sd (a, b, c, 0, 10); } __m128d f6 (__m128d a, __m128d b, __m128d c) { return _mm_maskz_fmadd_round_sd (0, a, b, c, 11); } __m128 f7 (__m128 a, __m128 b, __m128 c) { return _mm_mask_fmadd_ss (a, 0, b, c); } __m128 f8 (__m128 a, __m128 b, __m128 c) { return _mm_mask3_fmadd_ss (a, b, c, 0); } __m128 f9 (__m128 a, __m128 b, __m128 c) { return _mm_maskz_fmadd_ss (0, a, b, c); } __m128 f10 (__m128 a, __m128 b, __m128 c) { return _mm_mask_fmadd_round_ss (a, 0, b, c, 9); } __m128 f11 (__m128 a, __m128 b, __m128 c) { return _mm_mask3_fmadd_round_ss (a, b, c, 0, 10); } __m128 f12 (__m128 a, __m128 b, __m128 c) { return _mm_maskz_fmadd_round_ss (0, a, b, c, 11); } __m128d f13 (__m128d a, __m128d b, __m128d c) { return _mm_mask_fmsub_sd (a, 0, b, c); } __m128d f14 (__m128d a, __m128d b, __m128d c) { return _mm_mask3_fmsub_sd (a, b, c, 0); } __m128d f15 (__m128d a, __m128d b, __m128d c) { return _mm_maskz_fmsub_sd (0, a, b, c); } __m128d f16 (__m128d a, __m128d b, __m128d c) { return _mm_mask_fmsub_round_sd (a, 0, b, c, 9); } __m128d f17 (__m128d a, __m128d b, __m128d c) { return _mm_mask3_fmsub_round_sd (a, b, c, 0, 10); } __m128d f18 (__m128d a, __m128d b, __m128d c) { return _mm_maskz_fmsub_round_sd (0, a, b, c, 11); } __m128 f19 (__m128 a, __m128 b, __m128 c) { return _mm_mask_fmsub_ss (a, 0, b, c); } __m128 f20 (__m128 a, __m128 b, __m128 c) { return _mm_mask3_fmsub_ss (a, b, c, 0); } __m128 f21 (__m128 a, __m128 b, __m128 c) { return _mm_maskz_fmsub_ss (0, a, b, c); } __m128 f22 (__m128 a, __m128 b, __m128 c) { return _mm_mask_fmsub_round_ss (a, 0, b, c, 9); } __m128 f23 (__m128 a, __m128 b, __m128 c) { return _mm_mask3_fmsub_round_ss (a, b, c, 0, 10); } __m128 f24 (__m128 a, __m128 b, __m128 c) { return _mm_maskz_fmsub_round_ss (0, a, b, c, 11); } __m128d f25 (__m128d a, __m128d b, __m128d c) { return _mm_mask_fnmadd_sd (a, 0, b, c); } __m128d f26 (__m128d a, __m128d b, __m128d c) { return _mm_mask3_fnmadd_sd (a, b, c, 0); } __m128d f27 (__m128d a, __m128d b, __m128d c) { return _mm_maskz_fnmadd_sd (0, a, b, c); } __m128d f28 (__m128d a, __m128d b, __m128d c) { return _mm_mask_fnmadd_round_sd (a, 0, b, c, 9); } __m128d f29 (__m128d a, __m128d b, __m128d c) { return _mm_mask3_fnmadd_round_sd (a, b, c, 0, 10); } __m128d f30 (__m128d a, __m128d b, __m128d c) { return _mm_maskz_fnmadd_round_sd (0, a, b, c, 11); } __m128 f31 (__m128 a, __m128 b, __m128 c) { return _mm_mask_fnmadd_ss (a, 0, b, c); } __m128 f32 (__m128 a, __m128 b, __m128 c) { return _mm_mask3_fnmadd_ss (a, b, c, 0); } __m128 f33 (__m128 a, __m128 b, __m128 c) { return _mm_maskz_fnmadd_ss (0, a, b, c); } __m128 f34 (__m128 a, __m128 b, __m128 c) { return _mm_mask_fnmadd_round_ss (a, 0, b, c, 9); } __m128 f35 (__m128 a, __m128 b, __m128 c) { return _mm_mask3_fnmadd_round_ss (a, b, c, 0, 10); } __m128 f36 (__m128 a, __m128 b, __m128 c) { return _mm_maskz_fnmadd_round_ss (0, a, b, c, 11); } __m128d f37 (__m128d a, __m128d b, __m128d c) { return _mm_mask_fnmsub_sd (a, 0, b, c); } __m128d f38 (__m128d a, __m128d b, __m128d c) { return _mm_mask3_fnmsub_sd (a, b, c, 0); } __m128d f39 (__m128d a, __m128d b, __m128d c) { return _mm_maskz_fnmsub_sd (0, a, b, c); } __m128d f40 (__m128d a, __m128d b, __m128d c) { return _mm_mask_fnmsub_round_sd (a, 0, b, c, 9); } __m128d f41 (__m128d a, __m128d b, __m128d c) { return _mm_mask3_fnmsub_round_sd (a, b, c, 0, 10); } __m128d f42 (__m128d a, __m128d b, __m128d c) { return _mm_maskz_fnmsub_round_sd (0, a, b, c, 11); } __m128 f43 (__m128 a, __m128 b, __m128 c) { return _mm_mask_fnmsub_ss (a, 0, b, c); } __m128 f44 (__m128 a, __m128 b, __m128 c) { return _mm_mask3_fnmsub_ss (a, b, c, 0); } __m128 f45 (__m128 a, __m128 b, __m128 c) { return _mm_maskz_fnmsub_ss (0, a, b, c); } __m128 f46 (__m128 a, __m128 b, __m128 c) { return _mm_mask_fnmsub_round_ss (a, 0, b, c, 9); } __m128 f47 (__m128 a, __m128 b, __m128 c) { return _mm_mask3_fnmsub_round_ss (a, b, c, 0, 10); } __m128 f48 (__m128 a, __m128 b, __m128 c) { return _mm_maskz_fnmsub_round_ss (0, a, b, c, 11); }
#include <x86intrin.h> __m128d a, b, c, d; __m128 e, f, g, h; __mmask8 m; void f1 (void) { d = _mm_mask_fmadd_sd (a, m, b, c); } void f2 (void) { d = _mm_mask3_fmadd_sd (a, b, c, m); } void f3 (void) { d = _mm_maskz_fmadd_sd (m, a, b, c); } void f4 (void) { d = _mm_mask_fmadd_round_sd (a, m, b, c, 9); } void f5 (void) { d = _mm_mask3_fmadd_round_sd (a, b, c, m, 10); } void f6 (void) { d = _mm_maskz_fmadd_round_sd (m, a, b, c, 11); } void f7 (void) { h = _mm_mask_fmadd_ss (e, m, f, g); } void f8 (void) { h = _mm_mask3_fmadd_ss (e, f, g, m); } void f9 (void) { h = _mm_maskz_fmadd_ss (m, e, f, g); } void f10 (void) { h = _mm_mask_fmadd_round_ss (e, m, f, g, 9); } void f11 (void) { h = _mm_mask3_fmadd_round_ss (e, f, g, m, 10); } void f12 (void) { h = _mm_maskz_fmadd_round_ss (m, e, f, g, 11); } void f13 (void) { d = _mm_mask_fmsub_sd (a, m, b, c); } void f14 (void) { d = _mm_mask3_fmsub_sd (a, b, c, m); } void f15 (void) { d = _mm_maskz_fmsub_sd (m, a, b, c); } void f16 (void) { d = _mm_mask_fmsub_round_sd (a, m, b, c, 9); } void f17 (void) { d = _mm_mask3_fmsub_round_sd (a, b, c, m, 10); } void f18 (void) { d = _mm_maskz_fmsub_round_sd (m, a, b, c, 11); } void f19 (void) { h = _mm_mask_fmsub_ss (e, m, f, g); } void f20 (void) { h = _mm_mask3_fmsub_ss (e, f, g, m); } void f21 (void) { h = _mm_maskz_fmsub_ss (m, e, f, g); } void f22 (void) { h = _mm_mask_fmsub_round_ss (e, m, f, g, 9); } void f23 (void) { h = _mm_mask3_fmsub_round_ss (e, f, g, m, 10); } void f24 (void) { h = _mm_maskz_fmsub_round_ss (m, e, f, g, 11); } void f25 (void) { d = _mm_mask_fnmadd_sd (a, m, b, c); } void f26 (void) { d = _mm_mask3_fnmadd_sd (a, b, c, m); } void f27 (void) { d = _mm_maskz_fnmadd_sd (m, a, b, c); } void f28 (void) { d = _mm_mask_fnmadd_round_sd (a, m, b, c, 9); } void f29 (void) { d = _mm_mask3_fnmadd_round_sd (a, b, c, m, 10); } void f30 (void) { d = _mm_maskz_fnmadd_round_sd (m, a, b, c, 11); } void f31 (void) { h = _mm_mask_fnmadd_ss (e, m, f, g); } void f32 (void) { h = _mm_mask3_fnmadd_ss (e, f, g, m); } void f33 (void) { h = _mm_maskz_fnmadd_ss (m, e, f, g); } void f34 (void) { h = _mm_mask_fnmadd_round_ss (e, m, f, g, 9); } void f35 (void) { h = _mm_mask3_fnmadd_round_ss (e, f, g, m, 10); } void f36 (void) { h = _mm_maskz_fnmadd_round_ss (m, e, f, g, 11); } void f37 (void) { d = _mm_mask_fnmsub_sd (a, m, b, c); } void f38 (void) { d = _mm_mask3_fnmsub_sd (a, b, c, m); } void f39 (void) { d = _mm_maskz_fnmsub_sd (m, a, b, c); } void f40 (void) { d = _mm_mask_fnmsub_round_sd (a, m, b, c, 9); } void f41 (void) { d = _mm_mask3_fnmsub_round_sd (a, b, c, m, 10); } void f42 (void) { d = _mm_maskz_fnmsub_round_sd (m, a, b, c, 11); } void f43 (void) { h = _mm_mask_fnmsub_ss (e, m, f, g); } void f44 (void) { h = _mm_mask3_fnmsub_ss (e, f, g, m); } void f45 (void) { h = _mm_maskz_fnmsub_ss (m, e, f, g); } void f46 (void) { h = _mm_mask_fnmsub_round_ss (e, m, f, g, 9); } void f47 (void) { h = _mm_mask3_fnmsub_round_ss (e, f, g, m, 10); } void f48 (void) { h = _mm_maskz_fnmsub_round_ss (m, e, f, g, 11); }
#include <x86intrin.h> __m128d a, b, c, d; __m128 e, f, g, h; void f1 (void) { d = _mm_mask_fmadd_sd (a, -1, b, c); } void f2 (void) { d = _mm_mask3_fmadd_sd (a, b, c, -1); } void f3 (void) { d = _mm_maskz_fmadd_sd (-1, a, b, c); } void f4 (void) { d = _mm_mask_fmadd_round_sd (a, -1, b, c, 9); } void f5 (void) { d = _mm_mask3_fmadd_round_sd (a, b, c, -1, 10); } void f6 (void) { d = _mm_maskz_fmadd_round_sd (-1, a, b, c, 11); } void f7 (void) { e = _mm_mask_fmadd_ss (a, -1, b, c); } void f8 (void) { e = _mm_mask3_fmadd_ss (a, b, c, -1); } void f9 (void) { e = _mm_maskz_fmadd_ss (-1, a, b, c); } void f10 (void) { e = _mm_mask_fmadd_round_ss (a, -1, b, c, 9); } void f11 (void) { e = _mm_mask3_fmadd_round_ss (a, b, c, -1, 10); } void f12 (void) { e = _mm_maskz_fmadd_round_ss (-1, a, b, c, 11); } void f13 (void) { d = _mm_mask_fmsub_sd (a, -1, b, c); } void f14 (void) { d = _mm_mask3_fmsub_sd (a, b, c, -1); } void f15 (void) { d = _mm_maskz_fmsub_sd (-1, a, b, c); } void f16 (void) { d = _mm_mask_fmsub_round_sd (a, -1, b, c, 9); } void f17 (void) { d = _mm_mask3_fmsub_round_sd (a, b, c, -1, 10); } void f18 (void) { d = _mm_maskz_fmsub_round_sd (-1, a, b, c, 11); } void f19 (void) { e = _mm_mask_fmsub_ss (a, -1, b, c); } void f20 (void) { e = _mm_mask3_fmsub_ss (a, b, c, -1); } void f21 (void) { e = _mm_maskz_fmsub_ss (-1, a, b, c); } void f22 (void) { e = _mm_mask_fmsub_round_ss (a, -1, b, c, 9); } void f23 (void) { e = _mm_mask3_fmsub_round_ss (a, b, c, -1, 10); } void f24 (void) { e = _mm_maskz_fmsub_round_ss (-1, a, b, c, 11); } void f25 (void) { d = _mm_mask_fnmadd_sd (a, -1, b, c); } void f26 (void) { d = _mm_mask3_fnmadd_sd (a, b, c, -1); } void f27 (void) { d = _mm_maskz_fnmadd_sd (-1, a, b, c); } void f28 (void) { d = _mm_mask_fnmadd_round_sd (a, -1, b, c, 9); } void f29 (void) { d = _mm_mask3_fnmadd_round_sd (a, b, c, -1, 10); } void f30 (void) { d = _mm_maskz_fnmadd_round_sd (-1, a, b, c, 11); } void f31 (void) { e = _mm_mask_fnmadd_ss (a, -1, b, c); } void f32 (void) { e = _mm_mask3_fnmadd_ss (a, b, c, -1); } void f33 (void) { e = _mm_maskz_fnmadd_ss (-1, a, b, c); } void f34 (void) { e = _mm_mask_fnmadd_round_ss (a, -1, b, c, 9); } void f35 (void) { e = _mm_mask3_fnmadd_round_ss (a, b, c, -1, 10); } void f36 (void) { e = _mm_maskz_fnmadd_round_ss (-1, a, b, c, 11); } void f37 (void) { d = _mm_mask_fnmsub_sd (a, -1, b, c); } void f38 (void) { d = _mm_mask3_fnmsub_sd (a, b, c, -1); } void f39 (void) { d = _mm_maskz_fnmsub_sd (-1, a, b, c); } void f40 (void) { d = _mm_mask_fnmsub_round_sd (a, -1, b, c, 9); } void f41 (void) { d = _mm_mask3_fnmsub_round_sd (a, b, c, -1, 10); } void f42 (void) { d = _mm_maskz_fnmsub_round_sd (-1, a, b, c, 11); } void f43 (void) { e = _mm_mask_fnmsub_ss (a, -1, b, c); } void f44 (void) { e = _mm_mask3_fnmsub_ss (a, b, c, -1); } void f45 (void) { e = _mm_maskz_fnmsub_ss (-1, a, b, c); } void f46 (void) { e = _mm_mask_fnmsub_round_ss (a, -1, b, c, 9); } void f47 (void) { e = _mm_mask3_fnmsub_round_ss (a, b, c, -1, 10); } void f48 (void) { e = _mm_maskz_fnmsub_round_ss (-1, a, b, c, 11); }
#include <x86intrin.h> __m128d a, b, c, d; __m128 e, f, g, h; void f1 (void) { d = _mm_mask_fmadd_sd (a, 0, b, c); } void f2 (void) { d = _mm_mask3_fmadd_sd (a, b, c, 0); } void f3 (void) { d = _mm_maskz_fmadd_sd (0, a, b, c); } void f4 (void) { d = _mm_mask_fmadd_round_sd (a, 0, b, c, 9); } void f5 (void) { d = _mm_mask3_fmadd_round_sd (a, b, c, 0, 10); } void f6 (void) { d = _mm_maskz_fmadd_round_sd (0, a, b, c, 11); } void f7 (void) { e = _mm_mask_fmadd_ss (a, 0, b, c); } void f8 (void) { e = _mm_mask3_fmadd_ss (a, b, c, 0); } void f9 (void) { e = _mm_maskz_fmadd_ss (0, a, b, c); } void f10 (void) { e = _mm_mask_fmadd_round_ss (a, 0, b, c, 9); } void f11 (void) { e = _mm_mask3_fmadd_round_ss (a, b, c, 0, 10); } void f12 (void) { e = _mm_maskz_fmadd_round_ss (0, a, b, c, 11); } void f13 (void) { d = _mm_mask_fmsub_sd (a, 0, b, c); } void f14 (void) { d = _mm_mask3_fmsub_sd (a, b, c, 0); } void f15 (void) { d = _mm_maskz_fmsub_sd (0, a, b, c); } void f16 (void) { d = _mm_mask_fmsub_round_sd (a, 0, b, c, 9); } void f17 (void) { d = _mm_mask3_fmsub_round_sd (a, b, c, 0, 10); } void f18 (void) { d = _mm_maskz_fmsub_round_sd (0, a, b, c, 11); } void f19 (void) { e = _mm_mask_fmsub_ss (a, 0, b, c); } void f20 (void) { e = _mm_mask3_fmsub_ss (a, b, c, 0); } void f21 (void) { e = _mm_maskz_fmsub_ss (0, a, b, c); } void f22 (void) { e = _mm_mask_fmsub_round_ss (a, 0, b, c, 9); } void f23 (void) { e = _mm_mask3_fmsub_round_ss (a, b, c, 0, 10); } void f24 (void) { e = _mm_maskz_fmsub_round_ss (0, a, b, c, 11); } void f25 (void) { d = _mm_mask_fnmadd_sd (a, 0, b, c); } void f26 (void) { d = _mm_mask3_fnmadd_sd (a, b, c, 0); } void f27 (void) { d = _mm_maskz_fnmadd_sd (0, a, b, c); } void f28 (void) { d = _mm_mask_fnmadd_round_sd (a, 0, b, c, 9); } void f29 (void) { d = _mm_mask3_fnmadd_round_sd (a, b, c, 0, 10); } void f30 (void) { d = _mm_maskz_fnmadd_round_sd (0, a, b, c, 11); } void f31 (void) { e = _mm_mask_fnmadd_ss (a, 0, b, c); } void f32 (void) { e = _mm_mask3_fnmadd_ss (a, b, c, 0); } void f33 (void) { e = _mm_maskz_fnmadd_ss (0, a, b, c); } void f34 (void) { e = _mm_mask_fnmadd_round_ss (a, 0, b, c, 9); } void f35 (void) { e = _mm_mask3_fnmadd_round_ss (a, b, c, 0, 10); } void f36 (void) { e = _mm_maskz_fnmadd_round_ss (0, a, b, c, 11); } void f37 (void) { d = _mm_mask_fnmsub_sd (a, 0, b, c); } void f38 (void) { d = _mm_mask3_fnmsub_sd (a, b, c, 0); } void f39 (void) { d = _mm_maskz_fnmsub_sd (0, a, b, c); } void f40 (void) { d = _mm_mask_fnmsub_round_sd (a, 0, b, c, 9); } void f41 (void) { d = _mm_mask3_fnmsub_round_sd (a, b, c, 0, 10); } void f42 (void) { d = _mm_maskz_fnmsub_round_sd (0, a, b, c, 11); } void f43 (void) { e = _mm_mask_fnmsub_ss (a, 0, b, c); } void f44 (void) { e = _mm_mask3_fnmsub_ss (a, b, c, 0); } void f45 (void) { e = _mm_maskz_fnmsub_ss (0, a, b, c); } void f46 (void) { e = _mm_mask_fnmsub_round_ss (a, 0, b, c, 9); } void f47 (void) { e = _mm_mask3_fnmsub_round_ss (a, b, c, 0, 10); } void f48 (void) { e = _mm_maskz_fnmsub_round_ss (0, a, b, c, 11); }