https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115186
Bug ID: 115186 Summary: Suboptimal codes generated by rtl-expand for divmod Product: gcc Version: 15.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: rtl-optimization Assignee: unassigned at gcc dot gnu.org Reporter: dizhao at os dot amperecomputing.com Target Milestone: --- For the code below: typedef unsigned short int uint16_t; typedef unsigned char uint8_t; typedef long unsigned int size_t; uint16_t fletcher16(const uint8_t *data, const size_t len) { register uint16_t sum1 = 0, sum2 = 0; register size_t i; for (i = 0; i < len; i++) { sum1 = (sum1 + data[i]) % 255; sum2 = (sum2 + sum1) % 255; } return ((uint16_t)(sum2 << 8)) | sum1; } , when compiled with "-O3 -mcpu=ampere1a", shift+add is used instead of div. The dump file at 266r.expand shows: 20: NOTE_INSN_BASIC_BLOCK 4 21: debug i => r114:DI-r116:DI 22: debug sum2 => r111:SI#0 23: debug sum1 => r110:SI#0 24: debug begin stmt marker 25: r118:SI=zero_extend([r114:DI]) 26: r119:SI=r118:SI+r110:SI 27: r120:DI=zero_extend(r119:SI) 28: r121:DI=r120:DI 29: r122:DI=r121:DI<<0x8 30: r123:DI=r122:DI+r120:DI REG_EQUAL r120:DI*0x101 31: r124:DI=r123:DI<<0x10 32: r125:DI=r123:DI+r124:DI REG_EQUAL r120:DI*0x1010101 33: r126:DI=r125:DI<<0x7 34: r127:DI=r126:DI+r120:DI REG_EQUAL r120:DI*0x80808081 35: r128:DI=r127:DI 0>>0x20 36: r104:SI=r128:DI#0 0>>0x7 REG_EQUAL udiv(r119:SI,0xff) ... However, using mult instruction is better, like (the result can be produce with "-mtune=neoverse-n1"): 20: NOTE_INSN_BASIC_BLOCK 4 21: debug i => r114:DI-r116:DI 22: debug sum2 => r111:SI#0 23: debug sum1 => r110:SI#0 24: debug begin stmt marker 25: r118:SI=zero_extend([r114:DI]) 26: r119:SI=r118:SI+r110:SI 27: r121:SI=0xffffffff80808081 28: r120:DI=zero_extend(r119:SI)*zero_extend(r121:SI) 29: r122:DI=r120:DI 0>>0x20 30: r104:SI=r122:DI#0 0>>0x7 REG_EQUAL udiv(r119:SI,0xff) ... The problem is, in expmed.cc:expmed_mult_highpart, when the result here is 0: /* See whether the specialized multiplication optabs are cheaper than the shift/add version. */ tem = expmed_mult_highpart_optab (mode, op0, narrow_op1, target, unsignedp, , the "tem" produced by the code that follows (shift/add version) can be more expensive than the result of: return expmed_mult_highpart_optab (mode, op0, op1, target, unsignedp, max_cost); For -mcpu=ampere1a, the estimated cost of the former is 36, the cost of the latter is 28.