[PATCH #2] Introduce smul_highpart and umul_highpart RTX for high-part multiplications

Roger Sayle Wed, 29 Sep 2021 08:25:38 -0700

Hi Richard,

All excellent suggestions.  The revised patch below implements all of
your (and Andreas') recommendations.  I'm happy to restrict GCC's support
for saturating arithmetic to integer types, even though I do know of one
target (nvptx) that supports saturating floating point math, where results
are clamped to [0.0, 1.0], but I've not investigated how NaNs or signed
zeros are handled.


Good catch on my min/max typo.  It convinced me to work harder to come
up with some test cases for these simplifications, which I've managed to
trigger on x86_64-pc-linux-gnu in the four new attached test cases.

This patch has been tested on x86_64-pc-linux-gnu with "make bootstrap"
and "make -k check" with no new failures.  Ok for mainline?

2021-09-29  Roger Sayle  <ro...@nextmovesoftware.com>
            Richard Sandiford  <richard.sandif...@arm.com>

gcc/ChangeLog
        * gcc/rtl.def (SMUL_HIGHPART, UMUL_HIGHPART): New RTX codes for
        representing signed and unsigned high-part multiplication resp.
        * gcc/simplify-rtx.c (simplify_binary_operation_1) [SMUL_HIGHPART,
        UMUL_HIGHPART]: Simplify high-part multiplications by zero.
        [SS_PLUS, US_PLUS, SS_MINUS, US_MINUS, SS_MULT, US_MULT,
        SS_DIV, US_DIV]: Similar simplifications for saturating
        arithmetic.
        (simplify_const_binary_operation) [SS_PLUS, US_PLUS, SS_MINUS,
        US_MINUS, SS_MULT, US_MULT, SMUL_HIGHPART, UMUL_HIGHPART]:
        Implement compile-time evaluation for constant operands.

        * gcc/dwarf2out.c (mem_loc_descriptor): Skip SMUL_HIGHPART and
        UMUL_HIGHPART.
        * doc/rtl.texi (smul_highpart, umul_highpart): Document RTX codes.
        * doc/md.texi (smul@var{m}3_highpart, umul@var{m3}_highpart):
        Mention the new smul_highpart and umul_highpart RTX codes.
        * doc/invoke.texi: Silence @xref "compilation" warnings.

gcc/testsuite/ChangeLog
        * gcc.target/i386/sse2-mmx-paddsb-2.c: New test case.
        * gcc.target/i386/sse2-mmx-paddusb-2.c: New test case.
        * gcc.target/i386/sse2-mmx-subsb-2.c: New test case.
        * gcc.target/i386/sse2-mmx-subusb-2.c: New test case.

Roger
--

-----Original Message-----
From: Richard Sandiford <richard.sandif...@arm.com> 
Sent: 27 September 2021 16:44
To: Roger Sayle <ro...@nextmovesoftware.com>
Cc: 'GCC Patches' <gcc-patches@gcc.gnu.org>
Subject: Re: [PATCH] Introduce sh_mul and uh_mul RTX codes for high-part 
multiplications

"Roger Sayle" <ro...@nextmovesoftware.com> writes:
> This patch introduces new RTX codes to allow the RTL passes and 
> backends to consistently represent high-part multiplications.
> Currently, the RTL used by different backends for expanding 
> smul<mode>3_highpart and umul<mode>3_highpart varies greatly, with 
> many but not all choosing to express this something like:
>
> (define_insn "smuldi3_highpart"
>   [(set (match_operand:DI 0 "nvptx_register_operand" "=R")
>        (truncate:DI
>         (lshiftrt:TI
>          (mult:TI (sign_extend:TI
>                    (match_operand:DI 1 "nvptx_register_operand" "R"))
>                   (sign_extend:TI
>                    (match_operand:DI 2 "nvptx_register_operand" "R")))
>          (const_int 64))))]
>   ""
>   "%.\\tmul.hi.s64\\t%0, %1, %2;")
>
> One complication with using this "widening multiplication" 
> representation is that it requires an intermediate in a wider mode, 
> making it difficult or impossible to encode a high-part multiplication 
> of the widest supported integer mode.

Yeah.  It's also a problem when representing vector ops.

> A second is that it can interfere with optimization; for example 
> simplify-rtx.c contains the comment:
>
>    case TRUNCATE:
>       /* Don't optimize (lshiftrt (mult ...)) as it would interfere
>          with the umulXi3_highpart patterns.  */
>
> Hopefully these problems are solved (or reduced) by introducing a new 
> canonical form for high-part multiplications in RTL passes.
> This also simplifies insn patterns when one operand is constant.
>
> Whilst implementing some constant folding simplifications and 
> compile-time evaluation of these new RTX codes, I noticed that this 
> functionality could also be added for the existing saturating 
> arithmetic RTX codes.  Then likewise when documenting these new RTX 
> codes, I also took the opportunity to silence the @xref warnings in 
> invoke.texi.
>
> This patch has been tested on x86_64-pc-linux-gnu with "make bootstrap"
> and "make -k check" with no new failures.  Ok for mainline?
>
>
> 2021-09-25  Roger Sayle  <ro...@nextmovesoftware.com>
>
> gcc/ChangeLog
>       * gcc/rtl.def (SH_MULT, UH_MULT): New RTX codes for representing
>       signed and unsigned high-part multiplication respectively.
>       * gcc/simplify-rtx.c (simplify_binary_operation_1) [SH_MULT,
>       UH_MULT]: Simplify high-part multiplications by zero.
>       [SS_PLUS, US_PLUS, SS_MINUS, US_MINUS, SS_MULT, US_MULT,
>       SS_DIV, US_DIV]: Similar simplifications for saturating
>       arithmetic.
>       (simplify_const_binary_operation) [SS_PLUS, US_PLUS, SS_MINUS,
>       US_MINUS, SS_MULT, US_MULT, SH_MULT, UH_MULT]: Implement
>       compile-time evaluation for constant operands.
>       * gcc/dwarf2out.c (mem_loc_descriptor): Skip SH_MULT and UH_MULT.
>       * doc/rtl.texi (sh_mult, uhmult): Document new RTX codes.
>       * doc/md.texi (smul@var{m}3_highpart, umul@var{m3}_highpart):
>       Mention the new sh_mul and uh_mul RTX codes.
>       * doc/invoke.texi: Silence @xref "compilation" warnings.

Look like a good idea to me.  Only real comment is on the naming:
if possible, I think we should try to avoid introducing yet more differences 
between optab names and rtl codes.  How about umul_highpart for the unsigned 
code, to match both the optab and the existing convention of adding “u” 
directly to the front of non-saturating operations?

Things are more inconsistent for signed rtx codes: sometimes the “s” is present 
and sometimes it isn't.  But since “smin” and “smax”
have it, I think we can justify having it here too.

So I think we should use smul_highpart and umul_highpart.
It's a bit more wordy than sh_mul, but still a lot shorter than the status quo 
;-)

> diff --git a/gcc/simplify-rtx.c b/gcc/simplify-rtx.c index 
> ebad5cb..b4b04b9 100644
> --- a/gcc/simplify-rtx.c
> +++ b/gcc/simplify-rtx.c
> @@ -4142,11 +4142,40 @@ simplify_context::simplify_binary_operation_1 
> (rtx_code code,
>      case US_PLUS:
>      case SS_MINUS:
>      case US_MINUS:
> +      /* Simplify x + 0 to x, if possible.  */

Nit: +/-

> +      if (trueop1 == CONST0_RTX (mode) && !HONOR_SIGNED_ZEROS (mode))

The HONOR_SIGNED_ZEROS check is redundant, since these ops don't support modes 
with signed zero.

Same for the other HONOR_* macros in the patch.  E.g. I don't think we should 
try to guess how infinities and saturation work together.

> +     return op0;
> +      return 0;
> +
>      case SS_MULT:
>      case US_MULT:
> +      /* Simplify x * 0 to 0, if possible.  */
> +      if (trueop1 == CONST0_RTX (mode)
> +       && !HONOR_NANS (mode)
> +       && !HONOR_SIGNED_ZEROS (mode)
> +       && !side_effects_p (op0))
> +     return op1;
> +
> +      /* Simplify x * 1 to x, if possible.  */
> +      if (trueop1 == CONST1_RTX (mode) && !HONOR_SNANS (mode))
> +     return op0;
> +      return 0;
> +
> +    case SH_MULT:
> +    case UH_MULT:
> +      /* Simplify x * 0 to 0, if possible.  */
> +      if (trueop1 == CONST0_RTX (mode)
> +       && !HONOR_NANS (mode)
> +       && !HONOR_SIGNED_ZEROS (mode)
> +       && !side_effects_p (op0))
> +     return op1;
> +      return 0;
> +
>      case SS_DIV:
>      case US_DIV:
> -      /* ??? There are simplifications that can be done.  */
> +      /* Simplify x / 1 to x, if possible.  */
> +      if (trueop1 == CONST1_RTX (mode) && !HONOR_SNANS (mode))
> +     return op0;
>        return 0;
>  
>      case VEC_SERIES:
> @@ -5011,6 +5040,63 @@ simplify_const_binary_operation (enum rtx_code code, 
> machine_mode mode,
>             }
>           break;
>         }
> +
> +     case SS_PLUS:
> +       result = wi::add (pop0, pop1, SIGNED, &overflow);

I think a goto label would be good here, so that later signed ops can reuse 
this code instead of having to repeat it.
Same idea for the unsigned case.

> +          if (overflow == wi::OVF_OVERFLOW)
> +         result = wi::max_value (GET_MODE_PRECISION (int_mode), SIGNED);
> +       else if (overflow == wi::OVF_UNDERFLOW)
> +         result = wi::max_value (GET_MODE_PRECISION (int_mode), SIGNED);

Should be min_value.  Same for the other underflow handlers.

Like Andreas said, @pxref would be better where applicable.

Thanks,
Richard

> +          else if (overflow != wi::OVF_NONE)
> +         return NULL_RTX;
> +       break;
> +
> +     case US_PLUS:
> +       result = wi::add (pop0, pop1, UNSIGNED, &overflow);
> +          if (overflow != wi::OVF_NONE)
> +         result = wi::max_value (GET_MODE_PRECISION (int_mode), UNSIGNED);
> +       break;
> +
> +     case SS_MINUS:
> +       result = wi::sub (pop0, pop1, SIGNED, &overflow);
> +          if (overflow == wi::OVF_OVERFLOW)
> +         result = wi::max_value (GET_MODE_PRECISION (int_mode), SIGNED);
> +       else if (overflow == wi::OVF_UNDERFLOW)
> +         result = wi::max_value (GET_MODE_PRECISION (int_mode), SIGNED);
> +          else if (overflow != wi::OVF_NONE)
> +         return NULL_RTX;
> +       break;
> +
> +     case US_MINUS:
> +       result = wi::sub (pop0, pop1, UNSIGNED, &overflow);
> +          if (overflow != wi::OVF_NONE)
> +         result = wi::min_value (GET_MODE_PRECISION (int_mode), UNSIGNED);
> +       break;
> +
> +     case SS_MULT:
> +       result = wi::mul (pop0, pop1, SIGNED, &overflow);
> +          if (overflow == wi::OVF_OVERFLOW)
> +         result = wi::max_value (GET_MODE_PRECISION (int_mode), SIGNED);
> +       else if (overflow == wi::OVF_UNDERFLOW)
> +         result = wi::max_value (GET_MODE_PRECISION (int_mode), SIGNED);
> +          else if (overflow != wi::OVF_NONE)
> +         return NULL_RTX;
> +       break;
> +
> +     case US_MULT:
> +       result = wi::mul (pop0, pop1, UNSIGNED, &overflow);
> +          if (overflow != wi::OVF_NONE)
> +         result = wi::max_value (GET_MODE_PRECISION (int_mode), UNSIGNED);
> +       break;
> +
> +     case SH_MULT:
> +       result = wi::mul_high (pop0, pop1, SIGNED);
> +       break;
> +
> +     case UH_MULT:
> +       result = wi::mul_high (pop0, pop1, UNSIGNED);
> +       break;
> +
>       default:
>         return NULL_RTX;
>       }

diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index 4acb941..7ed0c69 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -3125,7 +3125,7 @@ errors if these functions are not inlined everywhere they 
are called.
 @itemx -fno-modules-ts
 @opindex fmodules-ts
 @opindex fno-modules-ts
-Enable support for C++20 modules (@xref{C++ Modules}).  The
+Enable support for C++20 modules (@pxref{C++ Modules}).  The
 @option{-fno-modules-ts} is usually not needed, as that is the
 default.  Even though this is a C++20 feature, it is not currently
 implicitly enabled by selecting that standard version.
@@ -33553,7 +33553,7 @@ version selected, although in pre-C++20 versions, it is 
of course an
 extension.
 
 No new source file suffixes are required or supported.  If you wish to
-use a non-standard suffix (@xref{Overall Options}), you also need
+use a non-standard suffix (@pxref{Overall Options}), you also need
 to provide a @option{-x c++} option too.@footnote{Some users like to
 distinguish module interface files with a new suffix, such as naming
 the source @code{module.cppm}, which involves
@@ -33615,8 +33615,8 @@ to be resolved at the end of compilation.  Without 
this, imported
 macros are only resolved when expanded or (re)defined.  This option
 detects conflicting import definitions for all macros.
 
-@xref{C++ Module Mapper} for details of the @option{-fmodule-mapper}
-family of options.
+For details of the @option{-fmodule-mapper} family of options,
+@pxref{C++ Module Mapper}.
 
 @menu
 * C++ Module Mapper::       Module Mapper
@@ -33833,8 +33833,8 @@ dialect used and imports of the module.@footnote{The 
precise contents
 of this output may change.} The timestamp is the same value as that
 provided by the @code{__DATE__} & @code{__TIME__} macros, and may be
 explicitly specified with the environment variable
-@code{SOURCE_DATE_EPOCH}.  @xref{Environment Variables} for further
-details.
+@code{SOURCE_DATE_EPOCH}.  For further details
+@pxref{Environment Variables}.
 
 A set of related CMIs may be copied, provided the relative pathnames
 are preserved.
diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi
index 2b41cb7..ed35b8f 100644
--- a/gcc/doc/md.texi
+++ b/gcc/doc/md.texi
@@ -5776,11 +5776,13 @@ multiplication.
 @item @samp{smul@var{m}3_highpart}
 Perform a signed multiplication of operands 1 and 2, which have mode
 @var{m}, and store the most significant half of the product in operand 0.
-The least significant half of the product is discarded.
+The least significant half of the product is discarded.  This may be
+represented in RTL using a @code{smul_highpart} RTX expression.
 
 @cindex @code{umul@var{m}3_highpart} instruction pattern
 @item @samp{umul@var{m}3_highpart}
-Similar, but the multiplication is unsigned.
+Similar, but the multiplication is unsigned.  This may be represented
+in RTL using an @code{umul_highpart} RTX expression.
 
 @cindex @code{madd@var{m}@var{n}4} instruction pattern
 @item @samp{madd@var{m}@var{n}4}
diff --git a/gcc/doc/rtl.texi b/gcc/doc/rtl.texi
index e1e76a9..2058997 100644
--- a/gcc/doc/rtl.texi
+++ b/gcc/doc/rtl.texi
@@ -2524,7 +2524,19 @@ not be the same.
 For unsigned widening multiplication, use the same idiom, but with
 @code{zero_extend} instead of @code{sign_extend}.
 
+@findex smul_highpart
+@findex umul_highpart
+@cindex high-part multiplication
+@cindex multiplication high part
+@item (smul_highpart:@var{m} @var{x} @var{y})
+@itemx (umul_highpart:@var{m} @var{x} @var{y})
+Represents the high-part multiplication of @var{x} and @var{y} carried
+out in machine mode @var{m}.  @code{smul_highpart} returns the high part
+of a signed multiplication, @code{umul_highpart} returns the high part
+of an unsigned multiplication.
+
 @findex fma
+@cindex fused multiply-add
 @item (fma:@var{m} @var{x} @var{y} @var{z})
 Represents the @code{fma}, @code{fmaf}, and @code{fmal} builtin
 functions, which compute @samp{@var{x} * @var{y} + @var{z}}
diff --git a/gcc/dwarf2out.c b/gcc/dwarf2out.c
index 9876750..20f2c5d 100644
--- a/gcc/dwarf2out.c
+++ b/gcc/dwarf2out.c
@@ -16809,6 +16809,8 @@ mem_loc_descriptor (rtx rtl, machine_mode mode,
     case CONST_FIXED:
     case CLRSB:
     case CLOBBER:
+    case SMUL_HIGHPART:
+    case UMUL_HIGHPART:
       break;
 
     case CONST_STRING:
diff --git a/gcc/rtl.def b/gcc/rtl.def
index c80144b..5710a2e 100644
--- a/gcc/rtl.def
+++ b/gcc/rtl.def
@@ -467,6 +467,11 @@ DEF_RTL_EXPR(SS_MULT, "ss_mult", "ee", RTX_COMM_ARITH)
 /* Multiplication with unsigned saturation */
 DEF_RTL_EXPR(US_MULT, "us_mult", "ee", RTX_COMM_ARITH)
 
+/* Signed high-part multiplication.  */
+DEF_RTL_EXPR(SMUL_HIGHPART, "smul_highpart", "ee", RTX_COMM_ARITH)
+/* Unsigned high-part multiplication.  */
+DEF_RTL_EXPR(UMUL_HIGHPART, "umul_highpart", "ee", RTX_COMM_ARITH)
+
 /* Operand 0 divided by operand 1.  */
 DEF_RTL_EXPR(DIV, "div", "ee", RTX_BIN_ARITH)
 /* Division with signed saturation */
diff --git a/gcc/simplify-rtx.c b/gcc/simplify-rtx.c
index ebad5cb..7e8e2c3 100644
--- a/gcc/simplify-rtx.c
+++ b/gcc/simplify-rtx.c
@@ -4142,11 +4142,36 @@ simplify_context::simplify_binary_operation_1 (rtx_code 
code,
     case US_PLUS:
     case SS_MINUS:
     case US_MINUS:
+      /* Simplify x +/- 0 to x, if possible.  */
+      if (trueop1 == CONST0_RTX (mode))
+       return op0;
+      return 0;
+
     case SS_MULT:
     case US_MULT:
+      /* Simplify x * 0 to 0, if possible.  */
+      if (trueop1 == CONST0_RTX (mode)
+         && !side_effects_p (op0))
+       return op1;
+
+      /* Simplify x * 1 to x, if possible.  */
+      if (trueop1 == CONST1_RTX (mode))
+       return op0;
+      return 0;
+
+    case SMUL_HIGHPART:
+    case UMUL_HIGHPART:
+      /* Simplify x * 0 to 0, if possible.  */
+      if (trueop1 == CONST0_RTX (mode)
+         && !side_effects_p (op0))
+       return op1;
+      return 0;
+
     case SS_DIV:
     case US_DIV:
-      /* ??? There are simplifications that can be done.  */
+      /* Simplify x / 1 to x, if possible.  */
+      if (trueop1 == CONST1_RTX (mode))
+       return op0;
       return 0;
 
     case VEC_SERIES:
@@ -5011,6 +5036,51 @@ simplify_const_binary_operation (enum rtx_code code, 
machine_mode mode,
              }
            break;
          }
+
+       case SS_PLUS:
+         result = wi::add (pop0, pop1, SIGNED, &overflow);
+ clamp_signed_saturation:
+         if (overflow == wi::OVF_OVERFLOW)
+           result = wi::max_value (GET_MODE_PRECISION (int_mode), SIGNED);
+         else if (overflow == wi::OVF_UNDERFLOW)
+           result = wi::min_value (GET_MODE_PRECISION (int_mode), SIGNED);
+         else if (overflow != wi::OVF_NONE)
+           return NULL_RTX;
+         break;
+
+       case US_PLUS:
+         result = wi::add (pop0, pop1, UNSIGNED, &overflow);
+ clamp_unsigned_saturation: 
+         if (overflow != wi::OVF_NONE)
+           result = wi::max_value (GET_MODE_PRECISION (int_mode), UNSIGNED);
+         break;
+
+       case SS_MINUS:
+         result = wi::sub (pop0, pop1, SIGNED, &overflow);
+         goto clamp_signed_saturation;
+
+       case US_MINUS:
+         result = wi::sub (pop0, pop1, UNSIGNED, &overflow);
+         if (overflow != wi::OVF_NONE)
+           result = wi::min_value (GET_MODE_PRECISION (int_mode), UNSIGNED);
+         break;
+
+       case SS_MULT:
+         result = wi::mul (pop0, pop1, SIGNED, &overflow);
+         goto clamp_signed_saturation;
+
+       case US_MULT:
+         result = wi::mul (pop0, pop1, UNSIGNED, &overflow);
+         goto clamp_unsigned_saturation;
+
+       case SMUL_HIGHPART:
+         result = wi::mul_high (pop0, pop1, SIGNED);
+         break;
+
+       case UMUL_HIGHPART:
+         result = wi::mul_high (pop0, pop1, UNSIGNED);
+         break;
+
        default:
          return NULL_RTX;
        }

/* { dg-do compile } */
/* { dg-options "-O2" } */

typedef char v8qi __attribute__ ((vector_size (8)));

char foo()
{
  v8qi tx = { 1, 0, 0, 0, 0, 0, 0, 0 };
  v8qi ty = { 2, 0, 0, 0, 0, 0, 0, 0 };
  v8qi t = __builtin_ia32_paddsb(tx, ty);
  return t[0];
}

char bar()
{
  v8qi tx = { 100, 0, 0, 0, 0, 0, 0, 0 };
  v8qi ty = { 100, 0, 0, 0, 0, 0, 0, 0 };
  v8qi t = __builtin_ia32_paddsb(tx, ty);
  return t[0];
}

char baz()
{
  v8qi tx = { -100, 0, 0, 0, 0, 0, 0, 0 };
  v8qi ty = { -100, 0, 0, 0, 0, 0, 0, 0 };
  v8qi t = __builtin_ia32_paddsb(tx, ty);
  return t[0];
}

/* { dg-final { scan-assembler-times "movl\[ \\t\]+\\\$3," 1 } } */
/* { dg-final { scan-assembler-times "movl\[ \\t\]+\\\$127," 1 } } */
/* { dg-final { scan-assembler-times "movl\[ \\t\]+\\\$-128," 1 } } */
/* { dg-final { scan-assembler-not "paddsb\[ \\t\]+%xmm\[0-9\]+" } } */

/* { dg-do compile } */
/* { dg-options "-O2" } */

typedef char v8qi __attribute__ ((vector_size (8)));

char foo()
{
  v8qi tx = { 1, 0, 0, 0, 0, 0, 0, 0 };
  v8qi ty = { 2, 0, 0, 0, 0, 0, 0, 0 };
  v8qi t = __builtin_ia32_paddusb(tx, ty);
  return t[0];
}

char bar()
{
  v8qi tx = { 200, 0, 0, 0, 0, 0, 0, 0 };
  v8qi ty = { 200, 0, 0, 0, 0, 0, 0, 0 };
  v8qi t = __builtin_ia32_paddusb(tx, ty);
  return t[0];
}

/* { dg-final { scan-assembler-times "movl\[ \\t\]+\\\$3," 1 } } */
/* { dg-final { scan-assembler-times "movl\[ \\t\]+\\\$-1," 1 } } */
/* { dg-final { scan-assembler-not "paddusb\[ \\t\]+%xmm\[0-9\]+" } } */

/* { dg-do compile } */
/* { dg-options "-O2" } */

typedef char v8qi __attribute__ ((vector_size (8)));

char foo()
{
  v8qi tx = { 5, 0, 0, 0, 0, 0, 0, 0 };
  v8qi ty = { 2, 0, 0, 0, 0, 0, 0, 0 };
  v8qi t = __builtin_ia32_psubsb(tx, ty);
  return t[0];
}

char bar()
{
  v8qi tx = { -100, 0, 0, 0, 0, 0, 0, 0 };
  v8qi ty = { 100, 0, 0, 0, 0, 0, 0, 0 };
  v8qi t = __builtin_ia32_psubsb(tx, ty);
  return t[0];
}

char baz()
{
  v8qi tx = { 100, 0, 0, 0, 0, 0, 0, 0 };
  v8qi ty = { -100, 0, 0, 0, 0, 0, 0, 0 };
  v8qi t = __builtin_ia32_psubsb(tx, ty);
  return t[0];
}

/* { dg-final { scan-assembler-times "movl\[ \\t\]+\\\$3," 1 } } */
/* { dg-final { scan-assembler-times "movl\[ \\t\]+\\\$-128," 1 } } */
/* { dg-final { scan-assembler-times "movl\[ \\t\]+\\\$127," 1 } } */
/* { dg-final { scan-assembler-not "paddsb\[ \\t\]+%xmm\[0-9\]+" } } */

/* { dg-do compile } */
/* { dg-options "-O2" } */

typedef char v8qi __attribute__ ((vector_size (8)));

char foo()
{
  v8qi tx = { 5, 0, 0, 0, 0, 0, 0, 0 };
  v8qi ty = { 2, 0, 0, 0, 0, 0, 0, 0 };
  v8qi t = __builtin_ia32_psubusb(tx, ty);
  return t[0];
}

char bar()
{
  v8qi tx = { 100, 0, 0, 0, 0, 0, 0, 0 };
  v8qi ty = { 200, 0, 0, 0, 0, 0, 0, 0 };
  v8qi t = __builtin_ia32_psubusb(tx, ty);
  return t[0];
}

/* { dg-final { scan-assembler-times "movl\[ \\t\]+\\\$3," 1 } } */
/* { dg-final { scan-assembler-times "xorl\[ \\t\]+" 1 } } */
/* { dg-final { scan-assembler-not "psubusb\[ \\t\]+%xmm\[0-9\]+" } } */

[PATCH #2] Introduce smul_highpart and umul_highpart RTX for high-part multiplications

Reply via email to