On 17/03/2023 11:22 am, Roger Pau Monné wrote:
> On Mon, Jul 15, 2019 at 02:39:04PM +0000, Jan Beulich wrote:
>> This is faster than using the software implementation, and the insn is
>> available on all half-way recent hardware. Therefore convert
>> generic_hweight<N>() to out-of-line functions (without affecting Arm)
>> and use alternatives patching to replace the function calls.
>>
>> Note that the approach doesn#t work for clang, due to it not recognizing
>> -ffixed-*.
> I've been giving this a look, and I wonder if it would be fine to
> simply push and pop the scratch registers in the 'call' path of the
> alternative, as that won't require any specific compiler option.

It's been a long while, and in that time I've learnt a lot more about
performance, but my root objection to the approach taken here still
stands - it is penalising the common case to optimise some pointless
corner cases.

Yes - on the call path, an extra push/pop pair (or few) to get temp
registers is basically free.


Right now, the generic_hweight() helpers are static inline in
xen/bitops.h and this is nonsense.  The absolute best they should be is
extern inline in our new lib/ and I'll bet that the compilers stop
inlining them there and then.

Given new abi's like x86_64-v2 (which guarantees POPCNT as an available
feature), it would be nice to arrange to use __builtin_popcnt() to let
the compiler optimise to its hearts content, but outside of this case,
it is actively damaging to try and optimise for memory operands or to
inline the 8/16 case.

So, for x86 specifically, we want:

if ( CONFIG_POPCNT )
    __builtin_popcnt()
else
    ALT( "popcnt D -> a",
         "call arch_popcnt$N" )

and we can write arch_popcnt* in x86's lib/ and in assembly, because
these are trivial enough and we do need to be careful with registers/etc.

I'm not sure if a "+D" vs "D" will matter at the top level.  Probably
not, so it might be an easy way to get one tempt register.  Other temp
registers can come from push/pop.


While we're at it, we should split hweight out of bitops and write the
common header in such a way that it defaults to the generic
implementations in lib/, and that will subsume the ARM header and also
make this work on RISC-V for free.

~Andrew

Reply via email to