On 15/07/2020 03:39, 夏 晋 via Gcc wrote:
Hi everyone,
I'm trying to autovectorize the loop, and Thank you for the omnipotent
macros, everything goes alright. But recently I need to further optimize the
loop, I had some problems.
As our vector instruction can process 16 numbers at the same time, if the
for loop counter is equal or larger than 16, the loop will be autovectorized.
For example:
for (int i = 0; i <16; i++) c[i] = a[i] + b[i];
will goes to:
vld v0, a0
vld v1, a1
vadd v0,v0,v1
vfst v0, a2
And if I wrote code like: for (int i = 0; i <15; i++) c[i] = a[i] + b[i]; the
autovectorization will miss it. But we got a instruction "vlen", which can change
the length of the vector operation, and I wish to generate the assembler like this when the
loop counter is 15:
vlen 15
vld v0, a0
vld v1, a1
vadd v0,v0,v1
vfst v0, a2
What should I do to achieve this goal? I've tried to "define TARGET_HAVE_DOLOOP_BEGIN" and define_expand
"doloop_begin". and the "doloop_begin" won't be called. Is there any other way? and If the loop
counter is bigger than 16 like 30,31 or just a varable, what should I do with "vlen". Any hint would be
helpful. Thank you very much.
We have had similar issues with the AMD GCN port, in which the vector
length is 64 and many smaller vectorizable cases get missed.
There are two solutions (that I know of):
1. Implement "masked" vectors. GCC will then use just a portion of the
total vector in some cases. I don't know if your architecture can cope
with arbitrary masks, but you can probably simulate them using vector
conditionals, and still win (maybe). You can certainly recognise
constant masks that mearly change the length. Probably the vectorizer
code could be modified, via a new hook, to only generate masks that work
for you (masks generated via WHILE_ULT would be fine, for example).
2. Add extra, smaller vector modes that work the same, but your backend
inserts vlen adjustments as necessary (in the md_reorg pass, perhaps).
You might have V2, V4, V8, and V16, for example.
Or both: for GCN, arbitrary masks work fine, but not all of GCC can take
advantage of them, so I've been experimenting with adding multiple
vector length modes to make up the difference.
Andrew