Re: How to refine autovectorized loop

Andrew Stubbs Wed, 15 Jul 2020 05:45:21 -0700

On 15/07/2020 03:39, 夏 晋 via Gcc wrote:

Hi everyone,
   I'm trying to autovectorize the loop, and Thank you for the omnipotent 
macros, everything goes alright. But recently I need to further optimize the 
loop, I had some problems.
   As our vector instruction can process 16 numbers at the same time, if the 
for loop counter is equal or larger than 16, the loop will be autovectorized. 
For example:
   for (int i = 0; i <16; i++) c[i] = a[i] + b[i];
   will goes to:
   vld v0, a0
   vld v1, a1
   vadd v0,v0,v1
   vfst v0, a2
   And if I wrote code like: for (int i = 0; i <15; i++) c[i] = a[i] + b[i]; the 
autovectorization will miss it. But we got a instruction "vlen", which can change 
the length of the vector operation, and I wish to generate the assembler like this when the 
loop counter is 15:
   vlen 15
   vld v0, a0
   vld v1, a1
   vadd v0,v0,v1
   vfst v0, a2
   What should I do to achieve this goal? I've tried to "define TARGET_HAVE_DOLOOP_BEGIN" and define_expand 
"doloop_begin". and the "doloop_begin" won't be called. Is there any other way? and If the loop 
counter is bigger than 16 like 30,31 or just a varable, what should I do with "vlen". Any hint would be 
helpful. Thank you very much.

We have had similar issues with the AMD GCN port, in which the vectorlength is 64 and many smaller vectorizable cases get missed.


There are two solutions (that I know of):

1. Implement "masked" vectors. GCC will then use just a portion of thetotal vector in some cases. I don't know if your architecture can copewith arbitrary masks, but you can probably simulate them using vectorconditionals, and still win (maybe). You can certainly recogniseconstant masks that mearly change the length. Probably the vectorizercode could be modified, via a new hook, to only generate masks that workfor you (masks generated via WHILE_ULT would be fine, for example).

2. Add extra, smaller vector modes that work the same, but your backendinserts vlen adjustments as necessary (in the md_reorg pass, perhaps).You might have V2, V4, V8, and V16, for example.

Or both: for GCN, arbitrary masks work fine, but not all of GCC can takeadvantage of them, so I've been experimenting with adding multiplevector length modes to make up the difference.


Andrew

Re: How to refine autovectorized loop

Reply via email to