> -----Original Message----- > From: Richard Sandiford <richard.sandif...@arm.com> > Sent: Monday, February 27, 2023 9:33 PM > To: Tamar Christina via Gcc-patches <gcc-patches@gcc.gnu.org> > Cc: Tamar Christina <tamar.christ...@arm.com>; nd <n...@arm.com>; > rguent...@suse.de; j...@ventanamicro.com > Subject: Re: [PATCH 1/2]middle-end: Fix wrong overmatching of div-bitmask > by using new optabs [PR108583] > > Tamar Christina via Gcc-patches <gcc-patches@gcc.gnu.org> writes: > >> -----Original Message----- > >> From: Richard Sandiford <richard.sandif...@arm.com> > >> Sent: Monday, February 27, 2023 12:12 PM > >> To: Tamar Christina <tamar.christ...@arm.com> > >> Cc: Tamar Christina via Gcc-patches <gcc-patches@gcc.gnu.org>; nd > >> <n...@arm.com>; rguent...@suse.de; j...@ventanamicro.com > >> Subject: Re: [PATCH 1/2]middle-end: Fix wrong overmatching of > >> div-bitmask by using new optabs [PR108583] > >> > >> Tamar Christina <tamar.christ...@arm.com> writes: > >> > Hi, > >> > > >> >> > I avoided open coding it with add and shift because it creates a > >> >> > 4 instructions (and shifts which are typically slow) dependency > >> >> > chain instead of a load and multiply. This change, unless the > >> >> > target is known to optimize it further is unlikely to be > >> >> > beneficial. And by the time we get to costing the only > >> >> > alternative is to undo the existing pattern and > >> >> so you lose the general shift optimization. > >> >> > > >> >> > So it seemed unwise to open code as shifts, given the codegen > >> >> > out of the vectorizer would be degenerate for most targets or > >> >> > one needs the more complicated route of costing during pattern > matching already. > >> >> > >> >> Hmm, OK. That seems like a cost-model thing though, rather than > >> >> something that should be exposed through optabs. And I imagine > >> >> the open-coded version would still be better than nothing on > >> >> targets without > >> highpart multiply. > >> >> > >> >> So how about replacing the hook with one that simply asks whether > >> >> division through highpart multiplication is preferred over the > >> >> add/shift > >> sequence? > >> >> (Unfortunately it's not going to be possible to work that out from > >> >> existing > >> >> information.) > >> > > >> > So this doesn't work for SVE. For SVE the multiplication widening > >> > pass introduces FMAs at gimple level. So in the cases where the > >> > operation is fed from a widening multiplication we end up generating > FMA. > >> If that was it I could have matched FMA. > >> > > >> > But it also pushes the multiplication in the second operand because > >> > it no longer has a mul to share the results with. > >> > > >> > In any case, the gimple code is transformed into > >> > > >> > vect__3.8_122 = .MASK_LOAD (_29, 8B, loop_mask_121); > >> > vect_patt_57.9_123 = (vector([8,8]) unsigned short) vect__3.8_122; > >> > vect_patt_64.11_127 = .FMA (vect_patt_57.9_123, vect_cst__124, { > >> > 257, ... }); > >> > vect_patt_65.12_128 = vect_patt_64.11_127 >> 8; > >> > vect_patt_66.13_129 = .FMA (vect_patt_57.9_123, vect_cst__124, > >> > vect_patt_65.12_128); > >> > vect_patt_62.14_130 = vect_patt_66.13_129 >> 8; > >> > vect_patt_68.15_131 = (vector([8,8]) unsigned charD.21) > >> > vect_patt_62.14_130; > >> > > >> > This transformation is much worse than the original code, it > >> > extended the dependency chain with another expensive instruction. I > >> > can try to correct this in RTL by matching FMA and shift and > >> > splitting into MUL + > >> ADDHNB and hope CSE takes care of the extra mul. > >> > > >> > But this seems like a hack, and it's basically undoing the earlier > >> > transformation. It seems to me that the open coding is a bad idea. > >> > >> Could you post the patch that gives this result? I'll have a poke around. > > > > Sure, I'll post the new series, it needs all of them. > > Thanks. Which testcase did you use to get the above? >
#include <stdint.h> #define N 16 #define TYPE uint8_t void fun3(TYPE* restrict pixel, TYPE level, int n) { for (int i = 0; i < (n & -16); i+=1) pixel[i] = (pixel[i] * level) / 0xff; } > But since SVE does have highpart multiply, and since the assumption for SVE is > that MULH+shift is better than ADD*3+shift*2, shouldn't SVE just be one of > the targets for which the hook that "asks whether division through highpart > multiplication is preferred over the add/shift sequence" returns true? > Yes (it's also two adds not 3), but it's not correct for SVE2, which has addhnb, in which case 2x addhnb is much faster than MULH+shift. And the problem is that widening_mul will not allow add+shift to reach the backend because the ADD+shift were open coded. They are now subjected to further optimization. To summarize: Other targets: false SVE: false SVE2: true NEON: true SVE2 borked because MUL+ADD+SHIFT -> FMA+SHIFT. If you're saying you don't want the optimization for SVE2, then sure, happy to turn it off. But UMULH+LSR == 6 cycles on Neoverse-N2 and throughput of 1. 2x ADDHNB = 4 cycles and throughput of 2. Tamar. > > Richard