[PATCH 0/2] Enhance non-temporal store optimization

xuexiaomei Mon, 22 Dec 2025 21:45:15 -0800

This series optimizes the generation of nontemporal store instructions.

Currently, the aprefetch pass cannot guarantee memory alignment for
vectorized types, which inhibits the expand pass from generating
MOVNT* instructions. This patch fixes that.


Patch 1 introduces -fnon-temporal-store and the analysis logic.
Patch 2 extends the i386 backend to generate MOVNTDQ for V4SI/V8SI/
V16SI/V8HI/V16HI/V32HI/V16QI/V32QI/V64QI

Enabling the -fnon-temporal-store optimization results in an average 
memory bandwidth improvement of approximately 40% in the Scale, Add, 
and Triad sub-tests of the STREAM benchmark.
(https://www.cs.virginia.edu/stream/)

Bootstrapped and regtested on x86_64-linux-gnu with no regressions.

[email protected] (2):
  feat: Add -fnon-temporal-store option to enhance nontemporal store
    optimization
  i386: Extend MOVNTDQ support to cover various packed integer types

 gcc/common.opt                |   4 +
 gcc/config/i386/sse.md        |  13 ++-
 gcc/passes.def                |   3 +-
 gcc/tree-ssa-loop-niter.cc    | 151 ++++++++++++++++++++++++++++++++++
 gcc/tree-ssa-loop-niter.h     |   1 +
 gcc/tree-ssa-loop-prefetch.cc | 109 ++++++++++++++++++++----
 gcc/tree-vect-data-refs.cc    |   6 ++
 7 files changed, 267 insertions(+), 20 deletions(-)

-- 
2.22.0

[PATCH 0/2] Enhance non-temporal store optimization

Reply via email to