On 3/16/24 13:28, Jeff Law wrote:
>> Implementation Details (for posterity)
>> --------------------------------------
>> - basic idea is to have a splitter selected via a new predicate for
>> constant
>> being possible sum of two S12 and provide the transform.
>> This is however a 2 -> 2 transform which combine can't handle.
>> So we specify it using a define_insn_and_split.
> Right. For the record it looks like a 2->2 case because of the
> mvconst_internal define_insn_and_split.
>
> What I was talking about in the Tuesday meeting last week was the desire
> to rethink that pattern precisely because it drives us to need to
> implement patterns like yours rather than the more natural 3->2 or 4->3/2.
A few weeks, I did an experimental run removing that splitter isn't
catastrophic for SPEC, granted it is a pretty narrow view of the world :-)
Below is upstream vs. revert of mvconst_internal (for apples:apples just
the revert, none of my new splitter stuff)
Baseline Revert mvconst_int
1,214,172,747,858 | 1,225,245,908,131 | -0.91% <--
740,618,658,160 | 743,873,857,461 | -0.44% <-
692,128,469,436 | 695,695,894,602 | -0.52% <--
190,811,495,310 | 190,934,386,665 | -0.06%
225,751,808,189 | 225,909,747,497 | -0.07%
220,364,237,915 | 220,466,640,640 | -0.05%
179,157,361,261 | 179,357,613,835 | -0.11%
219,313,921,837 | 219,436,712,114 | -0.06%
281,344,210,410 | 281,344,197,305 | 0.00%
446,517,079,816 | 446,517,058,807 | 0.00%
347,300,137,757 | 347,300,119,942 | 0.00%
421,496,082,529 | 421,496,063,144 | 0.00%
669,319,255,911 | 673,977,112,732 | -0.70% <--
2,852,810,623,629 | 2,852,809,986,901 | 0.00%
1,855,884,349,001 | 1,855,837,810,109 | 0.00%
1,653,853,403,482 | 1,653,714,171,270 | 0.01%
2,974,347,287,057 | 2,970,520,074,342 | 0.13%
1,158,337,609,995 | 1,158,337,607,076 | 0.00%
1,019,181,486,091 | 1,020,576,716,760 | -0.14%
1,738,961,517,942 | 1,736,024,569,247 | 0.17%
849,330,280,930 | 848,955,738,855 | 0.04%
276,584,892,794 | 276,457,202,331 | 0.05%
913,410,589,335 | 913,407,101,214 | 0.00%
903,864,384,529 | 903,800,709,452 | 0.01%
1,654,905,086,567 | 1,656,684,048,052 | -0.11%
1,513,514,546,262 | 1,510,815,435,668 | 0.18%
1,641,980,078,831 | 1,602,410,552,874 | 2.41% <--
2,546,170,307,336 | 2,546,206,790,578 | 0.00%
1,983,551,321,388 | 1,979,562,936,994 | 0.20%
1,516,077,036,742 | 1,515,038,668,667 | 0.07%
2,056,386,745,670 | 2,055,592,903,700 | 0.04%
1,008,224,427,448 | 1,008,027,321,669 | 0.02%
1,169,305,141,789 | 1,169,028,937,430 | 0.02%
363,827,037,541 | 361,982,880,800 | 0.51% <--
906,649,110,459 | 909,651,995,900 | -0.33% <-
509,023,896,044 | 508,578,027,604 | 0.09%
403,238,430 | 398,492,922 | 1.18%
403,238,430 | 398,492,922 | 1.18%
38,917,902,479,830 38,886,374,486,212
Do note however that this takes us back to constant pool way of loading
complex constants (vs. bit fiddling and stitching). The 2.4% gain in
deepsjeng is exactly that and we need to decide what to do about it:
keep it as such or tweak it with a cost model change and/or make this
for everyone or have this per cpu tune - hard to tell whats the right
thing to do here.
P.S. Perhaps obvious but the prerequisite to revert is to tend to the
original linked PRs which will now start failing so that will hopefully
improve some of the above.
> Furthermore, I have a suspicion that there's logicals where we're going
> to want to do something similar and if we keep the mvconst_internal
> pattern they're all going to have to be implemented as 2->2s with a
> define_insn_and_split rather than the more natural 3->2.
Naive question: why is define_split more natural vs. define_insn_and_split.
Is it because of the syntax/implementation or that one is more Combine
"centric" and other is more of a Split* Pass thing (roughly speaking of
course) or something else altogether ?
Would we have to revisit the new splitter (and perhaps audit the
existing define_insn_and_split patterns) if we were to go ahead with
this revert ?
>> +(define_insn_and_split "*add<mode>3_const_sum_of_two_s12"
>> + [(set (match_operand:P 0 "register_operand" "=r,r")
>> + (plus:P (match_operand:P 1 "register_operand" " r,r")
>> + (match_operand:P 2 "const_two_s12" " MiG,r")))]
>> + ""
>> + "add %0,%1,%2"
>> + ""
>> + [(set (match_dup 0)
>> + (plus:P (match_dup 1) (match_dup 3)))
>> + (set (match_dup 0)
>> + (plus:P (match_dup 0) (match_dup 4)))]
>> +{
>> + int val = INTVAL (operands[2]);
>> + if (SUM_OF_TWO_S12_P (val))
>> + {
>> + operands[3] = GEN_INT (2047);
>> + operands[4] = GEN_INT (val - 2047);
>> + }
>> + else if (SUM_OF_TWO_S12_N (val))
>> + {
>> + operands[3] = GEN_INT (-2048);
>> + operands[4] = GEN_INT (val + 2048);
>> + }
>> + else
>> + gcc_unreachable ();
>> +}
>> + [(set_attr "type" "arith")
>> + (set_attr "mode" "<P:MODE>")])
> So why use "P" for your mode iterator? I would have expected GPR since
> I think this works for both SI and DI mode as-is.
My intent at least was to have this work for either of rv64/rv32 and in
either of those environments, work for both SI or DI - certainly
rv64+{SI,DI}.
To that end I debated between GPR, X and P.
It seems GPR only supports DI if TARGET_64BIT.
But I could well be wrong about above requirements or the subtle
differences in these.
> For the output template "#" for alternative 0 and the add instruction
> for alternative 1 is probably better. That way it's clear to everyone
> that alternative 0 is always meant to be split.
OK.
> Don't you need some kind of condition on the split to avoid splitting
> when you've got a register for operands[2]?
Won't the predicate "match_code" guarantee that ?
(define_predicate "const_two_s12"
(match_code "const_int")
{
return SUM_OF_TWO_S12 (INTVAL (op), false);
})
> It would seem to me that as
> written, it would still try to spit and trigger an RTL checking error
> when you try to extract INTVAL (operands[2]).
Do I need to build gcc in a certain way to expose such errors - I wasn't
able to trigger such an error for the entire testsuite or SPEC build.
Thx for the detailed quick review.
-Vineet