On 15/07/2020 03:39, 夏 晋 via Gcc wrote:
Hi everyone,
   I'm trying to autovectorize the loop, and Thank you for the omnipotent 
macros, everything goes alright. But recently I need to further optimize the 
loop, I had some problems.
   As our vector instruction can process 16 numbers at the same time, if the 
for loop counter is equal or larger than 16, the loop will be autovectorized. 
For example:
   for (int i = 0; i <16; i++) c[i] = a[i] + b[i];
   will goes to:
   vld v0, a0
   vld v1, a1
   vadd v0,v0,v1
   vfst v0, a2
   And if I wrote code like: for (int i = 0; i <15; i++) c[i] = a[i] + b[i]; the 
autovectorization will miss it. But we got a instruction "vlen", which can change 
the length of the vector operation, and I wish to generate the assembler like this when the 
loop counter is 15:
   vlen 15
   vld v0, a0
   vld v1, a1
   vadd v0,v0,v1
   vfst v0, a2
   What should I do to achieve this goal? I've tried to "define TARGET_HAVE_DOLOOP_BEGIN" and define_expand 
"doloop_begin". and the "doloop_begin" won't be called. Is there any other way? and If the loop 
counter is bigger than 16 like 30,31 or just a varable, what should I do with "vlen". Any hint would be 
helpful. Thank you very much.


We have had similar issues with the AMD GCN port, in which the vector length is 64 and many smaller vectorizable cases get missed.

There are two solutions (that I know of):

1. Implement "masked" vectors. GCC will then use just a portion of the total vector in some cases. I don't know if your architecture can cope with arbitrary masks, but you can probably simulate them using vector conditionals, and still win (maybe). You can certainly recognise constant masks that mearly change the length. Probably the vectorizer code could be modified, via a new hook, to only generate masks that work for you (masks generated via WHILE_ULT would be fine, for example).

2. Add extra, smaller vector modes that work the same, but your backend inserts vlen adjustments as necessary (in the md_reorg pass, perhaps). You might have V2, V4, V8, and V16, for example.

Or both: for GCN, arbitrary masks work fine, but not all of GCC can take advantage of them, so I've been experimenting with adding multiple vector length modes to make up the difference.

Andrew

Reply via email to