On Thu, Jun 05, 2025 at 11:29:11AM +0200, Burakov, Anatoly wrote:
> On 6/4/2025 4:59 PM, Bruce Richardson wrote:
> > On Fri, May 30, 2025 at 02:57:19PM +0100, Anatoly Burakov wrote:
> > > Currently, for 32-byte descriptor format, only SSE instruction set is
> > > supported. Add implementation for AVX2 and AVX512 instruction sets. Since
> > > we are using Rx descriptor definitions from common code, we can just use
> > > the generic descriptor definition, as we only ever write the first 16 
> > > bytes
> > > of it, and the layout is always the same for that part.
> > > 
> > > Signed-off-by: Anatoly Burakov <anatoly.bura...@intel.com>
> > > ---
> > > 
> > 
> > Like the idea. Feedback inline below.
> > 
> > /Bruce
> > 
> 
<snip>

> 
> > > -         /**
> > > -          * merge 0 & 1, by casting 0 to 256-bit and inserting 1
> > > -          * into the high lanes. Similarly for 2 & 3, and so on.
> > > -          */
> > > -         const __m256i addr0_256 = _mm256_castsi128_si256(vaddr0);
> > > -         const __m256i addr2_256 = _mm256_castsi128_si256(vaddr2);
> > > -         const __m256i addr4_256 = _mm256_castsi128_si256(vaddr4);
> > > -         const __m256i addr6_256 = _mm256_castsi128_si256(vaddr6);
> > > +                 const __m128i vaddr0 = _mm_loadu_si128((const __m128i 
> > > *)&mb0->buf_addr);
> > > +                 const __m128i vaddr1 = _mm_loadu_si128((const __m128i 
> > > *)&mb1->buf_addr);
> > > +                 const __m128i vaddr2 = _mm_loadu_si128((const __m128i 
> > > *)&mb2->buf_addr);
> > > +                 const __m128i vaddr3 = _mm_loadu_si128((const __m128i 
> > > *)&mb3->buf_addr);
> > > -         const __m256i addr0_1 = _mm256_inserti128_si256(addr0_256, 
> > > vaddr1, 1);
> > > -         const __m256i addr2_3 = _mm256_inserti128_si256(addr2_256, 
> > > vaddr3, 1);
> > > -         const __m256i addr4_5 = _mm256_inserti128_si256(addr4_256, 
> > > vaddr5, 1);
> > > -         const __m256i addr6_7 = _mm256_inserti128_si256(addr6_256, 
> > > vaddr7, 1);
> > > +                 reg0 = _ci_rxq_rearm_desc_avx512(vaddr0, zero, vaddr1, 
> > > zero);
> > > +                 reg1 = _ci_rxq_rearm_desc_avx512(vaddr2, zero, vaddr3, 
> > > zero);
> > 
> > I can't help but thinking we can probably do a little better than this
> > merging in zeros using AVX-512 mask registers, e.g. using
> >   _mm256_maskz_broadcastq_epi64()  intrinsic, but it will be ok for now! :-)
> > 
> 
> You're welcome to submit patches, this is a very welcoming community!
> 
> (seriously though, I'll look into it)
> 
Great, but I also think this is fine for now if you want to keep it as-is.
We can do a post-rework optimization patchset after this goes in. Main
thing is just to ensure we don't see a perf regression after this work.

/Bruce

Reply via email to