On Thu, Jun 05, 2025 at 11:29:11AM +0200, Burakov, Anatoly wrote: > On 6/4/2025 4:59 PM, Bruce Richardson wrote: > > On Fri, May 30, 2025 at 02:57:19PM +0100, Anatoly Burakov wrote: > > > Currently, for 32-byte descriptor format, only SSE instruction set is > > > supported. Add implementation for AVX2 and AVX512 instruction sets. Since > > > we are using Rx descriptor definitions from common code, we can just use > > > the generic descriptor definition, as we only ever write the first 16 > > > bytes > > > of it, and the layout is always the same for that part. > > > > > > Signed-off-by: Anatoly Burakov <anatoly.bura...@intel.com> > > > --- > > > > > > > Like the idea. Feedback inline below. > > > > /Bruce > > > <snip>
> > > > - /** > > > - * merge 0 & 1, by casting 0 to 256-bit and inserting 1 > > > - * into the high lanes. Similarly for 2 & 3, and so on. > > > - */ > > > - const __m256i addr0_256 = _mm256_castsi128_si256(vaddr0); > > > - const __m256i addr2_256 = _mm256_castsi128_si256(vaddr2); > > > - const __m256i addr4_256 = _mm256_castsi128_si256(vaddr4); > > > - const __m256i addr6_256 = _mm256_castsi128_si256(vaddr6); > > > + const __m128i vaddr0 = _mm_loadu_si128((const __m128i > > > *)&mb0->buf_addr); > > > + const __m128i vaddr1 = _mm_loadu_si128((const __m128i > > > *)&mb1->buf_addr); > > > + const __m128i vaddr2 = _mm_loadu_si128((const __m128i > > > *)&mb2->buf_addr); > > > + const __m128i vaddr3 = _mm_loadu_si128((const __m128i > > > *)&mb3->buf_addr); > > > - const __m256i addr0_1 = _mm256_inserti128_si256(addr0_256, > > > vaddr1, 1); > > > - const __m256i addr2_3 = _mm256_inserti128_si256(addr2_256, > > > vaddr3, 1); > > > - const __m256i addr4_5 = _mm256_inserti128_si256(addr4_256, > > > vaddr5, 1); > > > - const __m256i addr6_7 = _mm256_inserti128_si256(addr6_256, > > > vaddr7, 1); > > > + reg0 = _ci_rxq_rearm_desc_avx512(vaddr0, zero, vaddr1, > > > zero); > > > + reg1 = _ci_rxq_rearm_desc_avx512(vaddr2, zero, vaddr3, > > > zero); > > > > I can't help but thinking we can probably do a little better than this > > merging in zeros using AVX-512 mask registers, e.g. using > > _mm256_maskz_broadcastq_epi64() intrinsic, but it will be ok for now! :-) > > > > You're welcome to submit patches, this is a very welcoming community! > > (seriously though, I'll look into it) > Great, but I also think this is fine for now if you want to keep it as-is. We can do a post-rework optimization patchset after this goes in. Main thing is just to ensure we don't see a perf regression after this work. /Bruce