Tamar Christina <tamar.christ...@arm.com> writes:
> Hi Richard,
>> > [...]
>> > 3) So I abandoned vec-patterns and instead tried to do it in
>> > tree-vect-slp.c in vect_analyze_slp_instance just after the SLP tree
>> > is created.  Matching the SLP tree is quite simple and getting it to
>> > emit the right SLP tree was simple enough,except that at this point
>> > all data references and loads have already been calculated.
>> 
>> (3) seems like the way to go.  Can you explain in more detail why it didn't
>> work?  The SLP tree after matching should look something like this:
>> 
>>   REALPART_EXPR <*_10> = _4;
>>   IMAGPART_EXPR <*_10> = _13;
>> 
>>   _4 = .COMPLEX_ADD_ROT_90 (_12, _8)
>>   _13 = .COMPLEX_ADD_ROT_90 (_22, _6)
>> 
>>   _12 = REALPART_EXPR <*_3>;
>>   _22 = IMAGPART_EXPR <*_3>;
>> 
>>   _8 = REALPART_EXPR <*_5>;
>>   _6 = IMAGPART_EXPR <*_5>;
>> 
>> The operands to the individual .COMPLEX_ADD_ROT_90s aren't the
>> operands that actually determine the associated scalar result, but that's
>> bound to be the case with something that includes an internal permute.  All
>> we're trying to describe is an operation that does the right thing when
>> vectorised.
>> 
>> If you didn't have the .COMPLEX_ADD_ROT_90 and just fell back on mixed
>> two-operator SLP, the final node would be in the opposite order:
>> 
>>   _6 = IMAGPART_EXPR <*_5>;
>>   _8 = REALPART_EXPR <*_5>;
>> 
>> So if you're doing the matching after building the initial tree, you'd need 
>> to
>> swap the statements in that node so that _8 comes first and cancel the
>> associated load permute.  If you're doing the matching on the fly while
>> building the SLP tree then the subnodes should start out in the right order.
>
> Ah, I hadn't tried it this way because in the SLP version, I had originally 
> started with looking
> at the complex fma, which would have a considerably longer match pattern.
>
>   _3 = c_14(D) + _2;
>   _11 = REALPART_EXPR <*_3>;
>   _21 = IMAGPART_EXPR <*_3>;
>   _5 = a_15(D) + _2;
>   _22 = REALPART_EXPR <*_5>;
>   _12 = IMAGPART_EXPR <*_5>;
>   _7 = b_16(D) + _2;
>   _19 = REALPART_EXPR <*_7>;
>   _20 = IMAGPART_EXPR <*_7>;
>   _25 = _19 * _22;
>   _26 = _12 * _20;
>   _27 = _20 * _22;
>   _28 = _12 * _19;
>   _29 = _25 - _26;
>   _30 = _27 + _28;
>   _31 = _11 + _29;
>   _32 = _21 + _30;
>   REALPART_EXPR <*_3> = _31;
>   IMAGPART_EXPR <*_3> = _32;
>
> So In this case I should replace _31 and _32 right? but I can't remove the 
> other statements otherwise it'll complain later about the missing references. 
> I could replace _31 and _32 with something using all the variables I would 
> need, however when I tried this previously in vect-patterns there was a block 
> on build in functions with more than 4 arguments (and currently 3 is the 
> limit for built in functions in the def file as well).  I don’t know if that 
> same limitation is in place if I replace it in SLP.
>
> The complex add basically creates this vector
>
>  b⋅c - e⋅f + l 
>  b⋅e + c⋅f + n
>
> so I'd need 5 parameters and then I'm guessing the other expressions would be 
> removed by DCE at some point?

Are you planning to make the FCMLA behaviour directly available
as an internal function or provide a higher-level one that does
a full complex multiply, with the target lowering that into
individual instructions where necessary?

Either way, each individual FCMLA should only need three scalar inputs.
Like with FCADD, it doesn't matter whether the operands to the individual
scalar FCMLAs are the ones (or the only ones) that determine the
associated FCMLA scalar result.  All the node needs to do is describe
something that would work when vectorised.

What to do with the intermediate results you don't need is an
interesting question :-).  Like you say, I was hoping DCE would
get rid of them later.  Does that not work?

Thanks,
Richard

Reply via email to