On Mon, Apr 30, 2018 at 4:34 PM, Scott D Phillips <[email protected]> wrote: > Matt Turner <[email protected]> writes: > >> On Mon, Apr 30, 2018 at 10:25 AM, Scott D Phillips >> <[email protected]> wrote: >>> The reference for MOVNTDQA says: >>> >>> For WC memory type, the nontemporal hint may be implemented by >>> loading a temporary internal buffer with the equivalent of an >>> aligned cache line without filling this data to the cache. >>> [...] Subsequent MOVNTDQA reads to unread portions of the WC >>> cache line will receive data from the temporary internal >>> buffer if data is available. >>> >>> This hidden cache line sized temporary buffer can improve the >>> read performance from wc maps. >>> >>> v2: Add mfence at start of tiled_to_linear for streaming loads (Chris) >>> >>> Reviewed-by: Chris Wilson <[email protected]> >> >> I think I understand the mechanics of this change. Let me tell you >> what I think is going on and you can confirm my understanding. >> >> We want to inline a movntdqa-using memcpy function into the tiling >> functions for performance reasons. Since movntdqa is an SSE4.1 >> instruction, we must split out intel_tiled_memcpy.c into its own >> convenience library so it can be built with -msse4.1. >> >> Because the existing _mesa_streaming_load_memcpy is compiled into its >> own object file, we can't use it directly (and we might not want to >> use it directly anyway, since it calls _mm_mfence() we can just do it >> once at the very end of the calling function). So we implement a >> simple memcpy with movntdqa capable of handling only 16-bytes or >> 64-bytes. >> >> Somewhat oddly, we pass in _mesa_streaming_load_memcpy (which we >> cannot inline) and then actually call our simpler memcpy. It's safe to >> only protect the SSE4.1-using code with #if defined(USE_SSE41) because >> _mesa_streaming_load_memcpy is only passed if cpu_has_sse4_1 is true >> (in the next patch). >> >> Did I miss anything? > > You've got it exactly. > >>> --- >>> src/mesa/drivers/dri/i965/Makefile.am | 7 +++ >>> src/mesa/drivers/dri/i965/Makefile.sources | 6 ++- >>> src/mesa/drivers/dri/i965/intel_tiled_memcpy.c | 62 >>> ++++++++++++++++++++++++++ >>> src/mesa/drivers/dri/i965/meson.build | 18 ++++++-- >>> 4 files changed, 88 insertions(+), 5 deletions(-) >>> >>> diff --git a/src/mesa/drivers/dri/i965/Makefile.am >>> b/src/mesa/drivers/dri/i965/Makefile.am >>> index 889d4c68a2b..ff47add93f4 100644 >>> --- a/src/mesa/drivers/dri/i965/Makefile.am >>> +++ b/src/mesa/drivers/dri/i965/Makefile.am >>> @@ -92,8 +92,14 @@ libi965_gen11_la_CFLAGS = $(AM_CFLAGS) >>> -DGEN_VERSIONx10=110 >>> >>> noinst_LTLIBRARIES = \ >>> libi965_dri.la \ >>> + libintel_tiled_memcpy.la \ >>> $(I965_PERGEN_LIBS) >>> >>> +libintel_tiled_memcpy_la_SOURCES = \ >>> + $(intel_tiled_memcpy_FILES) >>> +libintel_tiled_memcpy_la_CFLAGS = \ >>> + $(AM_CFLAGS) $(SSE41_CFLAGS) >>> + >>> libi965_dri_la_SOURCES = \ >>> $(i965_FILES) \ >>> $(i965_oa_GENERATED_FILES) >>> @@ -104,6 +110,7 @@ libi965_dri_la_LIBADD = \ >>> $(top_builddir)/src/intel/compiler/libintel_compiler.la \ >>> $(top_builddir)/src/intel/blorp/libblorp.la \ >>> $(I965_PERGEN_LIBS) \ >>> + libintel_tiled_memcpy.la >>> $(LIBDRM_LIBS) >>> >>> BUILT_SOURCES = $(i965_oa_GENERATED_FILES) >>> diff --git a/src/mesa/drivers/dri/i965/Makefile.sources >>> b/src/mesa/drivers/dri/i965/Makefile.sources >>> index 5e53d874d88..a15bba50eec 100644 >>> --- a/src/mesa/drivers/dri/i965/Makefile.sources >>> +++ b/src/mesa/drivers/dri/i965/Makefile.sources >>> @@ -112,11 +112,13 @@ i965_FILES = \ >>> intel_tex_image.c \ >>> intel_tex_obj.h \ >>> intel_tex_validate.c \ >>> - intel_tiled_memcpy.c \ >>> - intel_tiled_memcpy.h \ >>> intel_upload.c \ >>> libdrm_macros.h >>> >>> +intel_tiled_memcpy_FILES = \ >>> + intel_tiled_memcpy.c \ >>> + intel_tiled_memcpy.h >>> + >>> i965_gen4_FILES = \ >>> genX_blorp_exec.c \ >>> genX_state_upload.c >>> diff --git a/src/mesa/drivers/dri/i965/intel_tiled_memcpy.c >>> b/src/mesa/drivers/dri/i965/intel_tiled_memcpy.c >>> index 7c6bde990d6..fac5427d2ed 100644 >>> --- a/src/mesa/drivers/dri/i965/intel_tiled_memcpy.c >>> +++ b/src/mesa/drivers/dri/i965/intel_tiled_memcpy.c >>> @@ -36,6 +36,10 @@ >>> #include "brw_context.h" >>> #include "intel_tiled_memcpy.h" >>> >>> +#if defined(USE_SSE41) >>> +#include "main/streaming-load-memcpy.h" >>> +#include <smmintrin.h> >>> +#endif >>> #if defined(__SSSE3__) >>> #include <tmmintrin.h> >>> #elif defined(__SSE2__) >>> @@ -213,6 +217,31 @@ rgba8_copy_aligned_src(void *dst, const void *src, >>> size_t bytes) >>> return dst; >>> } >>> >>> +#if defined(USE_SSE41) >>> +static ALWAYS_INLINE void * >>> +_memcpy_streaming_load(void *dest, const void *src, size_t count) >>> +{ >>> + if (count == 16) { >> >> After inlining and some optimization, the compiler will know for sure >> what count is, right? Curious if it will always be able to optimize >> away one of these branches. > > Exactly, we go to some lengths to have the count be constant in the fast > paths, which will make it constant here when it is inlined. The > generated assembly for the inner loop winds up being 4 movntdqa > instructions in a row, just like we want. > >>> + __m128i val = _mm_stream_load_si128((__m128i *)src); >>> + _mm_store_si128((__m128i *)dest, val); >>> + return dest; >>> + } else if (count == 64) { >>> + __m128i val0 = _mm_stream_load_si128(((__m128i *)src) + 0); >>> + __m128i val1 = _mm_stream_load_si128(((__m128i *)src) + 1); >>> + __m128i val2 = _mm_stream_load_si128(((__m128i *)src) + 2); >>> + __m128i val3 = _mm_stream_load_si128(((__m128i *)src) + 3); >>> + _mm_store_si128(((__m128i *)dest) + 0, val0); >>> + _mm_store_si128(((__m128i *)dest) + 1, val1); >>> + _mm_store_si128(((__m128i *)dest) + 2, val2); >>> + _mm_store_si128(((__m128i *)dest) + 3, val3); >>> + return dest; >>> + } else { >>> + assert(count < 64); /* and (count < 16) for ytiled */ >>> + return memcpy(dest, src, count); >> >> I would just make this unreachable(). > > This is another bit of a wart. This case actually is reached, the > mem_copy_align16 functions in the tiling/detiling functions are just > meant to guarantee that the source and destination are 16-byte aligned, > but nothing about the length. > > So, the mem_copy_align16 function will get called for the remnant of the > surface on the right side that is 16-byte aligned but not a multiple of > 16 bytes. > > We could also handle this by doing a streaming load and then maskmovdqu > to do a masked store. I suppose we know that the extra bytes read from > the 16-aligned load couldn't cross a page boundary. I just punted though > and used the system memcpy.
Cool. That all sounds good to me. Have a Reviewed-by: Matt Turner <[email protected]> _______________________________________________ mesa-dev mailing list [email protected] https://lists.freedesktop.org/mailman/listinfo/mesa-dev
