Hi Martin,

On Sat, Dec 01, 2018 at 05:40:40PM +0100, Martin Willi wrote:
> 
> > An SSSE3 implementation of single-block HChaCha20 is also added so
> > that XChaCha20 can use it rather than the generic
> > implementation.  This required refactoring the ChaCha permutation
> > into its own function. 
> 
> > [...]
> 
> > +ENTRY(chacha20_block_xor_ssse3)
> > +   # %rdi: Input state matrix, s
> > +   # %rsi: up to 1 data block output, o
> > +   # %rdx: up to 1 data block input, i
> > +   # %rcx: input/output length in bytes
> > +
> > +   # x0..3 = s0..3
> > +   movdqa          0x00(%rdi),%xmm0
> > +   movdqa          0x10(%rdi),%xmm1
> > +   movdqa          0x20(%rdi),%xmm2
> > +   movdqa          0x30(%rdi),%xmm3
> > +   movdqa          %xmm0,%xmm8
> > +   movdqa          %xmm1,%xmm9
> > +   movdqa          %xmm2,%xmm10
> > +   movdqa          %xmm3,%xmm11
> > +
> > +   mov             %rcx,%rax
> > +   call            chacha20_permute
> > +
> >     # o0 = i0 ^ (x0 + s0)
> >     paddd           %xmm8,%xmm0
> >     cmp             $0x10,%rax
> > @@ -189,6 +198,23 @@ ENTRY(chacha20_block_xor_ssse3)
> >  
> >  ENDPROC(chacha20_block_xor_ssse3)
> >  
> > +ENTRY(hchacha20_block_ssse3)
> > +   # %rdi: Input state matrix, s
> > +   # %rsi: output (8 32-bit words)
> > +
> > +   movdqa          0x00(%rdi),%xmm0
> > +   movdqa          0x10(%rdi),%xmm1
> > +   movdqa          0x20(%rdi),%xmm2
> > +   movdqa          0x30(%rdi),%xmm3
> > +
> > +   call            chacha20_permute
> 
> AFAIK, the general convention is to create proper stack frames using
> FRAME_BEGIN/END for non leaf-functions. Should chacha20_permute()
> callers do so?
> 

Yes, I'll do that.  (Ard suggested similarly in the arm64 version too.)

- Eric

Reply via email to