[Bug middle-end/109326] Sub-optimal assembler code generation for valid C on x86-64

susurrus.of.qualia at gmail dot com via Gcc-bugs Wed, 29 Mar 2023 19:32:36 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109326


--- Comment #6 from Steve Thompson <susurrus.of.qualia at gmail dot com> ---
(In reply to Steve Thompson from comment #5)
>           1    8   16   32
> 64B code:
> 
> 1.2K code:

Sorry, my touchpad glitched and sent prematurely.

For the overlarge vectorized version I hate:
[28]  nr_ops=1      nr_samples=1000000(0)   min=1       avg=5       max=12248
[28]  nr_ops=8      nr_samples=1000000(0)   min=1       avg=6       max=13022
[28]  nr_ops=16     nr_samples=1000000(0)   min=8       avg=11      max=9548 
[28]  nr_ops=32     nr_samples=1000000(0)   min=26      avg=33      max=8126 
[28]  nr_ops=64     nr_samples=1000000(0)   min=62      avg=73      max=11186
[28]  nr_ops=128    nr_samples=1000000(0)   min=134     avg=153     max=14426
[28]  nr_ops=256    nr_samples=1000000(0)   min=296     avg=312     max=12608
[28]  nr_ops=1024   nr_samples=1000000(0)   min=1250    avg=1269    max=23858

And the compact, esthetically pleasing version I like:
[28]  nr_ops=1      nr_samples=1000000(0)   min=1       avg=5       max=7910 
[28]  nr_ops=8      nr_samples=1000000(0)   min=1       avg=7       max=20150
[28]  nr_ops=16     nr_samples=1000000(0)   min=8       avg=24      max=11402
[28]  nr_ops=32     nr_samples=1000000(0)   min=62      avg=74      max=20582
[28]  nr_ops=64     nr_samples=1000000(0)   min=152     avg=153     max=12482
[28]  nr_ops=128    nr_samples=1000000(0)   min=296     avg=313     max=33884
[28]  nr_ops=256    nr_samples=1000000(0)   min=620     avg=632     max=22940
[28]  nr_ops=1024   nr_samples=1000000(0)   min=2528    avg=2546    max=25064

(System is an AMD Ryzen 5700U laptop; the [28] is the measured cycle latency of
the RDTSCP operation; ()'ed number shows bad samples occasionally).  


As it turns out, there are no advantages to the vectorized version until arrays
of 16; after that it is approximately twice as fast.  Some will be happy to pay
that cost for the extra performance I suppose, but it still seems wasteful.

Again, sorry for being an idiot.

[Bug middle-end/109326] Sub-optimal assembler code generation for valid C on x86-64

Reply via email to