This series optimizes the generation of nontemporal store instructions. Currently, the aprefetch pass cannot guarantee memory alignment for vectorized types, which inhibits the expand pass from generating MOVNT* instructions. This patch fixes that.
Patch 1 introduces -fnon-temporal-store and the analysis logic. Patch 2 extends the i386 backend to generate MOVNTDQ for V4SI/V8SI/ V16SI/V8HI/V16HI/V32HI/V16QI/V32QI/V64QI Enabling the -fnon-temporal-store optimization results in an average memory bandwidth improvement of approximately 40% in the Scale, Add, and Triad sub-tests of the STREAM benchmark. (https://www.cs.virginia.edu/stream/) Bootstrapped and regtested on x86_64-linux-gnu with no regressions. [email protected] (2): feat: Add -fnon-temporal-store option to enhance nontemporal store optimization i386: Extend MOVNTDQ support to cover various packed integer types gcc/common.opt | 4 + gcc/config/i386/sse.md | 13 ++- gcc/passes.def | 3 +- gcc/tree-ssa-loop-niter.cc | 151 ++++++++++++++++++++++++++++++++++ gcc/tree-ssa-loop-niter.h | 1 + gcc/tree-ssa-loop-prefetch.cc | 109 ++++++++++++++++++++---- gcc/tree-vect-data-refs.cc | 6 ++ 7 files changed, 267 insertions(+), 20 deletions(-) -- 2.22.0
