Hi Roland and Matt, I am sorry, but this patch segfaults/asserts Kronos CTS - ES2-CTS.gtf.GL2FixedTests.scissor.scissor
I created this bug: https://bugs.freedesktop.org/show_bug.cgi?id=93962 I also added the bug to the OpenGL ES 3.1 tracker bug, since it blocks CTS testing. https://bugs.freedesktop.org/show_bug.cgi?id=92788 /Marta > -----Original Message----- > From: mesa-dev [mailto:[email protected]] On > Behalf Of Roland Scheidegger > Sent: Friday, January 29, 2016 2:18 AM > To: Matt Turner > Cc: [email protected] > Subject: Re: [Mesa-dev] [PATCH 2/4] i965: Provide sse2 version for rgba8 <-> > bgra8 swizzle > > Am 29.01.2016 um 00:31 schrieb Matt Turner: > > On Sun, Jan 17, 2016 at 1:49 PM, <[email protected]> wrote: > >> From: Roland Scheidegger <[email protected]> > >> > >> The existing code used ssse3, and because it isn't compiled in a > >> separate file compiled with that, it is usually not used (that, of > >> course, could be fixed...), whereas sse2 is always present at least with > 64bit builds. > >> It is actually trivial to do with sse2 without pshufb, on some cpus > >> (I'm looking at you, atom!) it might not even be slower. > >> This is compile-tested only, it doesn't actually do what I really > >> want (which is glReadPixels without doing byte access from an > >> uncached region, which is what you'll get on intel chips not having > >> llc, if your cpu doesn't support sse41 (in which case the rb would be > >> copied over with movntdqa instead of mapped, so mesa format_utils > >> byte swapping conversion will then access the cached region instead > >> of the uncached one) - really need sse2-optimized convert_ubyte > >> functions for a proper fix, otherwise google maps in firefox is > >> reduced to fps below 1 fps), but hey why not. I don't have a gpu which > could possibly hit this, albeit I succesfully used the exact same code > elsewhere. > > > > Bah, wall of text! > > > > I don't think we need all this. The commit title says everything I > > think it needs to say. > > > >> v2: fix andnot argument order, add comments > >> --- > >> src/mesa/drivers/dri/i965/intel_mipmap_tree.c | 18 +++++++ > >> src/mesa/drivers/dri/i965/intel_tiled_memcpy.c | 73 > >> +++++++++++++++++++++----- > >> 2 files changed, 79 insertions(+), 12 deletions(-) > >> > >> diff --git a/src/mesa/drivers/dri/i965/intel_mipmap_tree.c > >> b/src/mesa/drivers/dri/i965/intel_mipmap_tree.c > >> index 108dd87..5fc4212 100644 > >> --- a/src/mesa/drivers/dri/i965/intel_mipmap_tree.c > >> +++ b/src/mesa/drivers/dri/i965/intel_mipmap_tree.c > >> @@ -2773,6 +2773,24 @@ intel_miptree_map(struct brw_context *brw, > >> } else if (!(mode & GL_MAP_WRITE_BIT) && > >> !mt->compressed && cpu_has_sse4_1 && > >> (mt->pitch % 16 == 0)) { > >> + /* > >> + * XXX: without sse4_1, in some situations still really want to copy > >> + * regardless. Presumably this is not done for performance reasons - > >> + * with movntdqa thanks to the 64byte streaming load buffer the > >> + * uncached->cached copy followed by cached->cached later is always > >> + * faster than doing "ordinary" uncached->cached copy. > >> + * Without movntdqa, of course an additional copy doesn't help, > albeit > >> + * it has to be said the uncached->cached one is an order of > magnitude > >> + * slower than the later cached->cached one in any case. > >> + * But mesa may not do a simple memcpy on that memory later - > some > >> + * glReadPixels paths for instance will well hit per-byte access > >> which > >> + * is a complete performance killer on uncached memory. So in these > >> + * cases really want to copy regardless, unless the map below could > >> + * play some tricks making the memory cached. > >> + * (Or otherwise ensure mesa can't hit these paths often, for > >> instance > >> + * glReadPixels requiring conversions could be handled by meta, so > >> in > >> + * end it really would be just memcpy.) > >> + */ > > > > Walls of text are really hard to read... > > > > I don't think adding this comment to the code is particularly > > valuable, but it does make me wonder if we should just add a memcpy > > fallback after this path? Maybe we don't care once we have this patch. > > Ok I'll remove that. It was rather difficult for me to figure out how mesa > ends > up doing the per-byte access out of uncached memory, which is why I wrote > it. > I don't know if the always memcpy is an acceptable solution (hence didn't do > anything about it). It will only add a quite small extra cost but OTOH that > perf > hit will be on the fast path which you probably expect to hit often (i.e. mesa > accessing things with a memcpy anyway). > My guess is the fastest solution would be trying to remap as cacheable > somehow, but I don't really know much about the driver (but I know tricks > like that get ugly real fast with all the fences, clflushes etc.) > > > > >> intel_miptree_map_movntdqa(brw, mt, map, level, slice); > >> #endif > >> } else { > >> diff --git a/src/mesa/drivers/dri/i965/intel_tiled_memcpy.c > >> b/src/mesa/drivers/dri/i965/intel_tiled_memcpy.c > >> index 2383401..42fdde1 100644 > >> --- a/src/mesa/drivers/dri/i965/intel_tiled_memcpy.c > >> +++ b/src/mesa/drivers/dri/i965/intel_tiled_memcpy.c > >> @@ -36,10 +36,13 @@ > >> #include "brw_context.h" > >> #include "intel_tiled_memcpy.h" > >> > >> -#ifdef __SSSE3__ > >> +#if defined(__SSSE3__) > >> #include <tmmintrin.h> > >> +#elif defined(__SSE2__) > >> +#include <emmintrin.h> > >> #endif > >> > >> + > >> #define FILE_DEBUG_FLAG DEBUG_TEXTURE > >> > >> #define ALIGN_DOWN(a, b) ROUND_DOWN_TO(a, b) @@ -56,23 +59,69 > @@ > >> static const uint32_t ytile_width = 128; static const uint32_t > >> ytile_height = 32; static const uint32_t ytile_span = 16; > >> > >> -#ifdef __SSSE3__ > >> +#if defined(__SSSE3__) > >> static const uint8_t rgba8_permutation[16] = > >> { 2,1,0,3, 6,5,4,7, 10,9,8,11, 14,13,12,15 }; > >> > >> /* NOTE: dst must be 16-byte aligned. src may be unaligned. */ > >> -#define rgba8_copy_16_aligned_dst(dst, src) \ > >> - _mm_store_si128((__m128i *)(dst), \ > >> - _mm_shuffle_epi8(_mm_loadu_si128((__m128i *)(src)), \ > >> - *(__m128i *) rgba8_permutation)) > >> +static inline void > >> +rgba8_copy_16_aligned_dst(void *dst, const void *src) { > >> + __m128i reg; > >> + reg = _mm_loadu_si128((__m128i *)src); > >> + reg = _mm_shuffle_epi8(reg, *(__m128i *)rgba8_permutation); > >> + _mm_store_si128((__m128i *)dst, reg); > > > > Here, and... > > > >> +} > >> > >> /* NOTE: src must be 16-byte aligned. dst may be unaligned. */ > >> -#define rgba8_copy_16_aligned_src(dst, src) \ > >> - _mm_storeu_si128((__m128i *)(dst), \ > >> - _mm_shuffle_epi8(_mm_load_si128((__m128i *)(src)), \ > >> - *(__m128i *) rgba8_permutation)) > >> +static inline void > >> +rgba8_copy_16_aligned_src(void *dst, const void *src) { > >> + __m128i reg; > >> + reg = _mm_load_si128((__m128i *)src); > >> + reg = _mm_shuffle_epi8(reg, *(__m128i *)rgba8_permutation); > >> + _mm_storeu_si128((__m128i *)dst, reg); > > > > ... here, I'd just make these single statements like they were before. > > They fit easily within 80 columns. > > Ok I can leave them alone. I just got rid of the macros because I hate them > (at least certainly if there's no reason to use them in the first place), and > I just > find the nested intrinsics not very readable but maybe that's just me. > > > > > >> +} > >> + > >> +#elif defined(__SSE2__) > >> +static inline void > >> +rgba8_copy_16_aligned_dst(void *dst, const void *src) { > >> + __m128i srcreg, dstreg, agmask, ag, rb, r, b; > >> + > >> + agmask = _mm_set1_epi32(0xFF00FF00); > >> + srcreg = _mm_loadu_si128((__m128i *)src); > > > > Okay, so we're starting with 0xAABBGGRR. > > > >> + > >> + rb = _mm_andnot_si128(agmask, srcreg); > > > > Produces 0x00BB00GG. Looks good. > Surely you meant 0x00BB00RR :-). > > > > >> + ag = _mm_and_si128(agmask, srcreg); > > > > Produces 0xAA00GG00. Looks good. > > > >> + r = _mm_srli_epi32(rb, 16); > > > > Produces 0x000000BB. Looks wrong -- should be b, not r. > > > >> + b = _mm_slli_epi32(rb, 16); > > > > Produces 0x00RR0000. Looks wrong -- should be r, not b. > Ahh sharp eyes indeed. Not that it would change the result ;-). > > > > >> + dstreg = _mm_or_si128(ag, r); > >> + dstreg = _mm_or_si128(dstreg, b); > > > > I think we can do this better. Instead of two shifts and two ORs, > > recognizing that the R and B values in rb have 8-bits of zero padding, > > we can use _mm_shufflehi_epi16/_mm_shufflelo_epi16 and then a single > > OR: > > > > br = _mm_shufflehi_epi16(_mm_shufflelo_epi16(rb, _MM_SHUFFLE(2, 3, > 0, > > 1)), _MM_SHUFFLE(2, 3, 0, 1)); dstreg = _mm_or_si128(ag, br); > > > > (appropriately line wrapped) > Right I missed the optimal solution. Not that it would make a real difference, > but it's certainly better. > > > > > >> + > >> + _mm_store_si128((__m128i *)dst, dstreg); } > >> + > >> +static inline void > >> +rgba8_copy_16_aligned_src(void *dst, const void *src) { > >> + __m128i srcreg, dstreg, agmask, ag, rb, r, b; > >> + > >> + agmask = _mm_set1_epi32(0xFF00FF00); > >> + srcreg = _mm_load_si128((__m128i *)src); > >> + > >> + rb = _mm_andnot_si128(agmask, srcreg); > >> + ag = _mm_and_si128(agmask, srcreg); > >> + r = _mm_srli_epi32(rb, 16); > >> + b = _mm_slli_epi32(rb, 16); > >> + dstreg = _mm_or_si128(ag, r); > >> + dstreg = _mm_or_si128(dstreg, b); > > > > Same comments here. > > > > With the walls of text removed, and the other couple of things > > addressed, this patch would be > > > > Reviewed-by: Matt Turner <[email protected]> > > > > Thanks for the review! > > Roland > > _______________________________________________ > mesa-dev mailing list > [email protected] > http://lists.freedesktop.org/mailman/listinfo/mesa-dev _______________________________________________ mesa-dev mailing list [email protected] http://lists.freedesktop.org/mailman/listinfo/mesa-dev
