On Tue, Feb 23, 2021 at 4:54 PM Dey, Megha <megha....@intel.com> wrote: > > Hi Andy, > > On 1/24/2021 8:23 AM, Andy Lutomirski wrote: > > On Fri, Jan 22, 2021 at 11:29 PM Megha Dey <megha....@intel.com> wrote: > >> Optimize crypto algorithms using AVX512 instructions - VAES and VPCLMULQDQ > >> (first implemented on Intel's Icelake client and Xeon CPUs). > >> > >> These algorithms take advantage of the AVX512 registers to keep the CPU > >> busy and increase memory bandwidth utilization. They provide substantial > >> (2-10x) improvements over existing crypto algorithms when update data size > >> is greater than 128 bytes and do not have any significant impact when used > >> on small amounts of data. > >> > >> However, these algorithms may also incur a frequency penalty and cause > >> collateral damage to other workloads running on the same core(co-scheduled > >> threads). These frequency drops are also known as bin drops where 1 bin > >> drop is around 100MHz. With the SpecCPU and ffmpeg benchmark, a 0-1 bin > >> drop(0-100MHz) is observed on Icelake desktop and 0-2 bin drops (0-200Mhz) > >> are observed on the Icelake server. > >> > >> The AVX512 optimization are disabled by default to avoid impact on other > >> workloads. In order to use these optimized algorithms: > >> 1. At compile time: > >> a. User must enable CONFIG_CRYPTO_AVX512 option > >> b. Toolchain(assembler) must support VPCLMULQDQ and VAES instructions > >> 2. At run time: > >> a. User must set module parameter use_avx512 at boot time > >> b. Platform must support VPCLMULQDQ and VAES features > >> > >> N.B. It is unclear whether these coarse grain controls(global module > >> parameter) would meet all user needs. Perhaps some per-thread control might > >> be useful? Looking for guidance here. > > > > I've just been looking at some performance issues with in-kernel AVX, > > and I have a whole pile of questions that I think should be answered > > first: > > > > What is the impact of using an AVX-512 instruction on the logical > > thread, its siblings, and other cores on the package? > > > > Does the impact depend on whether it’s a 512-bit insn or a shorter EVEX > > insn? > > > > What is the impact on subsequent shorter EVEX, VEX, and legacy > > SSE(2,3, etc) insns? > > > > How does VZEROUPPER figure in? I can find an enormous amount of > > misinformation online, but nothing authoritative. > > > > What is the effect of the AVX-512 states (5-7) being “in use”? As far > > as I can tell, the only operations that clear XINUSE[5-7] are XRSTOR > > and its variants. Is this correct? > > > > On AVX-512 capable CPUs, do we ever get a penalty for executing a > > non-VEX insn followed by a large-width EVEX insn without an > > intervening VZEROUPPER? The docs suggest no, since Broadwell and > > before don’t support EVEX, but I’d like to know for sure. > > > > > > My current opinion is that we should not enable AVX-512 in-kernel > > except on CPUs that we determine have good AVX-512 support. Based on > > some reading, that seems to mean Ice Lake Client and not anything > > before it. I also think a bunch of the above questions should be > > answered before we do any of this. Right now we have a regression of > > unknown impact in regular AVX support in-kernel, we will have > > performance issues in-kernel depending on what user code has done > > recently, and I'm still trying to figure out what to do about it. > > Throwing AVX-512 into the mix without real information is not going to > > improve the situation. > > We are currently working on providing you with answers on the questions > you have raised regarding AVX.
Thanks!