In Sun, 7 Apr 2019 17:13:31 +0200 Maarten de Boer <mdb.l...@resorama.com> wrote:
> Hi, > > > > […] > > col[0] /= 9.0, col[1] /= 9.0, col[2] /= 9.0, col[3] /= > > 9.0; 0x5a39 pxor %xmm0,%xmm0 > > […] > > > > Notice, how line, containing chain of divisions, is compiled to > > single sse operation. > > I don’t see any SSE operation here. The pxor is just to zero the xmm0 > register. > > It’s a bit difficult to know what you are doing here, not having > context and not knowing the datatypes, but it does indeed look like > this code could benefit from vectorisation, since you are doing > calculation in blocks of 4. E.g. you can multiply 4 floating points > in a single SSE instruction, add 4 floating points in a single SSE > instructions, etc. > > e.g. > > __m128 factor = _mm_set_ps1 (1.0f/9.0f); > __m128 result = _mm_mul_ps (packed, factor); > > would divide the 4 floats in packed by 9. (We could use _mm_div_ps , > but multiplication is faster than division) > > (See https://software.intel.com/sites/landingpage/IntrinsicsGuide/ > <https://software.intel.com/sites/landingpage/IntrinsicsGuide/> ) > > Depending on your data, it might be faster to stay in the floating > point domain as long as possible to use SSE floating point > operations, and convert to integer at the last moment. > > If you do want/need to stay in the integer domain, note that their is > no SIMD instruction for integer division, but you could use a > multiplication here as well: > > __m128i _mm_mulhi_epi16 (__m128i a, __m128i b) > > multiplies the packed 16-bit integers in a and b (so 8 at the same > time), producing intermediate 32-bit integers, and stores the high 16 > bits of the intermediate integers in the result. > > Taking the high 16 bits of the 32 bit intermediate result is > effectively dividing by 65536. Since x/9 can be expressed (with some > error) as x*7281/65536: > > __m128i factor = _mm_set1_epi16 (7282); > __m128i result = _mm_mulhi_epi16(packed, factor) > > Of course you would have to get your 8 bit integers (I assume) > into/out of the packed 16 bit registers. > > That said, whether you want to do this kind of vectorisation by hand > is a different matter. The compiler is pretty good in doing these > kind of optimisations. Make sure you pass the right flags to turn on > SSE and AVX at the levels you want to support. But it certainly is > possible to improve what the compilers does. I have obtained > significant speed boosts though rewriting inner loops with SSE > intrinsics. But even if you choose to stay in C, having some > knowledge of the SSE instruction set certainly might help. > > Maarten Thanks, Maarten. Looks like you propose to use intel-specific intrinsics. I already looked gcc docs in hope to find something similar, but only found vectorization section, from C extensions. Hoped to see something probablt gcc-specific, but not intel-spec. I made earlier experiments, getting purest sse code, while multiplying or adding elements from fixed-sized array, created as auto. I really did not recognize that nasty trick, clearing xmm0 :). Also i understood, why SSE can't be used there. Without integer division support it is undoable with SSE - replacing with multiplication means conversion to float. Yet, just as experiment, i replaced this and 4 add lines with: op[0] = (col[1] + col[2]) / 2; to look, wether it will involve PAVGW or similar - it did not. Can't really understand meaning of that single pxor line without other SSE stuff. But after i changed gcc options to -O2, more meaningful lines appeared where expected. Probably -O3 shuffled code too hard to correctly represent in debugger even with -g :/ . I'm now certain about to try FP way. More about my app - i'm learning in university, for software engineering. It so happened, that just in begining of this course i got final decision to begin what i searched for long time ago (these spectral editing helpers), and during session new subject appeared, where we had to write any application, relying on input/output (such as any GUI program), so i dedicated my plan to this. I'm still uncertain is it ok to publish it before it is defended, otherwise it is ready for that. As for post-proc - i'm experimenting with subpixel rendering (my another weakness :) ). I'm taking 3x3 pixel blocks (so-called super-pixels) and pass them through complete chain. Probably 3x1 could be good, it cairo has no problem rendering to surfaces with such virtual pixel ratio, for now it is that. In my sequence image is splited to grey minimum and color remainder, with grey mixed down at subpixel level, while color part simply pixelated, both summed in destination. Code chunk, i showed in previous post, is for this remainder averaging part. With current implementation and all cairo rendering to 3x res surf commented out, it has ≈30% more speed comparing to simple downsampling with cairo itself (when 3x surf itself is used as drawing source), but this is taking in account that for now source and dest surfaces are created and destroyed on each draw() callback run (i'm just about to solve this issue). _______________________________________________ Linux-audio-dev mailing list Linux-audio-dev@lists.linuxaudio.org https://lists.linuxaudio.org/listinfo/linux-audio-dev