https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109326
--- Comment #6 from Steve Thompson <susurrus.of.qualia at gmail dot com> --- (In reply to Steve Thompson from comment #5) > 1 8 16 32 > 64B code: > > 1.2K code: Sorry, my touchpad glitched and sent prematurely. For the overlarge vectorized version I hate: [28] nr_ops=1 nr_samples=1000000(0) min=1 avg=5 max=12248 [28] nr_ops=8 nr_samples=1000000(0) min=1 avg=6 max=13022 [28] nr_ops=16 nr_samples=1000000(0) min=8 avg=11 max=9548 [28] nr_ops=32 nr_samples=1000000(0) min=26 avg=33 max=8126 [28] nr_ops=64 nr_samples=1000000(0) min=62 avg=73 max=11186 [28] nr_ops=128 nr_samples=1000000(0) min=134 avg=153 max=14426 [28] nr_ops=256 nr_samples=1000000(0) min=296 avg=312 max=12608 [28] nr_ops=1024 nr_samples=1000000(0) min=1250 avg=1269 max=23858 And the compact, esthetically pleasing version I like: [28] nr_ops=1 nr_samples=1000000(0) min=1 avg=5 max=7910 [28] nr_ops=8 nr_samples=1000000(0) min=1 avg=7 max=20150 [28] nr_ops=16 nr_samples=1000000(0) min=8 avg=24 max=11402 [28] nr_ops=32 nr_samples=1000000(0) min=62 avg=74 max=20582 [28] nr_ops=64 nr_samples=1000000(0) min=152 avg=153 max=12482 [28] nr_ops=128 nr_samples=1000000(0) min=296 avg=313 max=33884 [28] nr_ops=256 nr_samples=1000000(0) min=620 avg=632 max=22940 [28] nr_ops=1024 nr_samples=1000000(0) min=2528 avg=2546 max=25064 (System is an AMD Ryzen 5700U laptop; the [28] is the measured cycle latency of the RDTSCP operation; ()'ed number shows bad samples occasionally). As it turns out, there are no advantages to the vectorized version until arrays of 16; after that it is approximately twice as fast. Some will be happy to pay that cost for the extra performance I suppose, but it still seems wasteful. Again, sorry for being an idiot.