On 28/08/2019 08:14, Jan Beulich wrote:
> On 27.08.2019 19:27, Andrew Cooper wrote:
>> On 27/08/2019 16:53, Jan Beulich wrote:
>>> On 27.08.2019 17:31, Andrew Cooper wrote:
>>>> On 01/07/2019 12:57, Jan Beulich wrote:
>>>>> --- a/xen/arch/x86/x86_emulate/x86_emulate.c
>>>>> +++ b/xen/arch/x86/x86_emulate/x86_emulate.c
>>>>> @@ -9124,6 +9126,48 @@ x86_emulate(
>>>>> ASSERT(!state->simd_size);
>>>>> break;
>>>>> + case X86EMUL_OPC_66(0x0f38, 0x82): /* invpcid reg,m128 */
>>>>> + vcpu_must_have(invpcid);
>>>>> + generate_exception_if(ea.type != OP_MEM, EXC_UD);
>>>>> + generate_exception_if(!mode_ring0(), EXC_GP, 0);
>>>>> +
>>>>> + if ( (rc = ops->read(ea.mem.seg, ea.mem.off, mmvalp, 16,
>>>>> + ctxt)) != X86EMUL_OKAY )
>>>>> + goto done;
>>>>
>>>> The actual behaviour in hardware is to not even read the memory
>>>> operand
>>>> if it is unused. You can demonstrate this by doing an ALL_INC_GLOBAL
>>>> flush with a non-canonical memory operand.
>>>
>>> Oh, that's sort of unexpected.
>>
>> It makes sense as an optimisation. There is no point fetching a memory
>> operand if you're not going to use it.
>>
>> Furthermore, it almost certainly reduces the microcode complexity.
>
> Probably. For comparison I had been thinking of 0-bit shifts instead,
> which do read their memory operands. Even SHLD/SHRD, which at least
> with shift count in %cl look to be uniformly microcoded, access their
> memory operand in this case.
Again, that isn't surprising to me.
You will never see a shift by 0 anywhere but a test suite, because it is
wasted effort. Therefore, any attempt to special case 0 will reduce
performance in all production scenarios.
SHLD/SHRD's microcoded-ness comes from having to construct a double
width rotate. In the worst case, this is two independent rotate uops
issued into the pipeline, and enough ALU logic to combine the results.
If you observe, some CPUs have the 32bit versions non-microcoded, which
will probably be the frontend converting up to a 64bit uop internally.
INV{PCID,EPT,VPID} are all conceptually of the form:
switch ( reg )
{
... construct tlb uop.
}
dispatch tlb uop.
and avoiding one or two memory reads will make a meaningful performance
improvement.
>
>>>> In particular, I was
>>>> intending to use this behaviour to speed up handling of INV{EPT,VPID}
>>>> which trap unconditionally.
>>>
>>> Which would require the observed behavior to also be the SDM
>>> mandated one, wouldn't it?
>>
>> If you recall, we discussed this with Jun in Budapest. His opinion was
>> no instructions go out of their way to check properties which don't
>> matter - it is just that it is far more obvious with instructions like
>> these where the complexity is much greater.
>>
>> No production systems are going to rely on getting faults, because
>> taking a fault doesn't produce useful work.
>
> Maybe I misunderstood your earlier reply then: I read it to mean you
> want to leverage INVPCID not faulting on "bad" memory operands for
> flush types not using the memory operand. But perhaps you merely
> meant you want to leverage the insn not _accessing_ its memory
> operand in this case?
Correct. Its to avoid unnecessary page walks.
~Andrew
_______________________________________________
Xen-devel mailing list
[email protected]
https://lists.xenproject.org/mailman/listinfo/xen-devel