https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115186

            Bug ID: 115186
           Summary: Suboptimal codes generated by rtl-expand for divmod
           Product: gcc
           Version: 15.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: rtl-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: dizhao at os dot amperecomputing.com
  Target Milestone: ---

For the code below:

        typedef unsigned short int uint16_t;
        typedef unsigned char uint8_t;
        typedef long unsigned int size_t;

        uint16_t fletcher16(const uint8_t *data, const size_t len)
        {
                register uint16_t sum1 = 0, sum2 = 0;
                register size_t i;

                for (i = 0; i < len; i++) {
                        sum1 = (sum1 + data[i]) % 255;
                        sum2 = (sum2 + sum1) % 255;
                }
                return ((uint16_t)(sum2 << 8)) | sum1;
        }

, when compiled with "-O3 -mcpu=ampere1a", shift+add is used instead of div.
The dump file at 266r.expand shows:

   20: NOTE_INSN_BASIC_BLOCK 4
   21: debug i => r114:DI-r116:DI
   22: debug sum2 => r111:SI#0
   23: debug sum1 => r110:SI#0
   24: debug begin stmt marker
   25: r118:SI=zero_extend([r114:DI])
   26: r119:SI=r118:SI+r110:SI
   27: r120:DI=zero_extend(r119:SI)
   28: r121:DI=r120:DI
   29: r122:DI=r121:DI<<0x8
   30: r123:DI=r122:DI+r120:DI
      REG_EQUAL r120:DI*0x101
   31: r124:DI=r123:DI<<0x10
   32: r125:DI=r123:DI+r124:DI
      REG_EQUAL r120:DI*0x1010101
   33: r126:DI=r125:DI<<0x7
   34: r127:DI=r126:DI+r120:DI
      REG_EQUAL r120:DI*0x80808081
   35: r128:DI=r127:DI 0>>0x20
   36: r104:SI=r128:DI#0 0>>0x7
      REG_EQUAL udiv(r119:SI,0xff)
   ...

However, using mult instruction is better, like (the result can be produce with
"-mtune=neoverse-n1"):

   20: NOTE_INSN_BASIC_BLOCK 4
   21: debug i => r114:DI-r116:DI
   22: debug sum2 => r111:SI#0
   23: debug sum1 => r110:SI#0
   24: debug begin stmt marker
   25: r118:SI=zero_extend([r114:DI])
   26: r119:SI=r118:SI+r110:SI
   27: r121:SI=0xffffffff80808081
   28: r120:DI=zero_extend(r119:SI)*zero_extend(r121:SI)
   29: r122:DI=r120:DI 0>>0x20
   30: r104:SI=r122:DI#0 0>>0x7
      REG_EQUAL udiv(r119:SI,0xff)
   ...

The problem is, in expmed.cc:expmed_mult_highpart, when the result here is 0:
      /* See whether the specialized multiplication optabs are
         cheaper than the shift/add version.  */
      tem = expmed_mult_highpart_optab (mode, op0, narrow_op1, target,
                                        unsignedp,

, the "tem" produced by the code that follows (shift/add version) can be more
expensive than the result of:
      return expmed_mult_highpart_optab (mode, op0, op1, target, unsignedp,
                                     max_cost);

For -mcpu=ampere1a, the estimated cost of the former is 36, the cost of the
latter is 28.
  • [Bug rtl-optimization/11... dizhao at os dot amperecomputing.com via Gcc-bugs

Reply via email to