On 21.03.2023 17:31, Roger Pau Monné wrote:
> On Tue, Mar 21, 2023 at 04:35:54PM +0100, Jan Beulich wrote:
>> On 21.03.2023 15:57, Roger Pau Monné wrote:
>>> On Mon, Mar 20, 2023 at 10:48:45AM +0100, Jan Beulich wrote:
>>>> On 17.03.2023 13:26, Andrew Cooper wrote:
>>>>> On 17/03/2023 11:22 am, Roger Pau Monné wrote:
>>>>>> On Mon, Jul 15, 2019 at 02:39:04PM +0000, Jan Beulich wrote:
>>>>>>> This is faster than using the software implementation, and the insn is
>>>>>>> available on all half-way recent hardware. Therefore convert
>>>>>>> generic_hweight<N>() to out-of-line functions (without affecting Arm)
>>>>>>> and use alternatives patching to replace the function calls.
>>>>>>>
>>>>>>> Note that the approach doesn#t work for clang, due to it not recognizing
>>>>>>> -ffixed-*.
>>>>>> I've been giving this a look, and I wonder if it would be fine to
>>>>>> simply push and pop the scratch registers in the 'call' path of the
>>>>>> alternative, as that won't require any specific compiler option.
>>>>
>>>> Hmm, ...
>>>>
>>>>> It's been a long while, and in that time I've learnt a lot more about
>>>>> performance, but my root objection to the approach taken here still
>>>>> stands - it is penalising the common case to optimise some pointless
>>>>> corner cases.
>>>>>
>>>>> Yes - on the call path, an extra push/pop pair (or few) to get temp
>>>>> registers is basically free.
>>>>
>>>> ... what is "a few"? We'd need to push/pop all call-clobbered registers
>>>> except %rax, i.e. a total of eight. I consider this too much. Unless,
>>>> as you suggest further down, we wrote the fallback in assembly. Which I
>>>> have to admit I'm surprised you propose when we strive to reduce the
>>>> amount of assembly we have to maintain.
>>>
>>> AMD added popcnt in 2007 and Intel in 2008.  While we shouldn't
>>> mandate popcnt support, I think we also shouldn't overly worry about
>>> the non-popcnt path.
>>
>> We agree here.
>>
>>> Also, how can you assert that the code generated without the scratch
>>> registers being usable won't be worse than the penalty of pushing and
>>> popping such registers on the stack and letting the routines use all
>>> registers freely?
>>
>> Irrespective of the -ffixed-* the compiler can make itself sufficiently
>> many registers available to use, by preserving just the ones it actually
>> uses. So that's pretty certainly not more PUSH/POP than when we would
>> blindly preserve all which may need preserving (no matter whether
>> they're actually used).
>>
>> Yet then both this question and ...
>>
>>> I very much prefer to have a non-optimal non-popcnt path, but have
>>> popcnt support for both gcc and clang, and without requiring any
>>> compiler options.
>>
>> ... this makes me wonder whether we're thinking of fundamentally
>> different generated code that we would end up with. Since the
>> (apparently agreed upon) goal is to optimize for the "popcnt
>> available" case, mainline code should be not significantly longer that
>> what's needed for the single instruction itself, or alternatively a
>> CALL insn. Adding pushes/pops of all call clobbered registers around
>> the CALL would grow mainline code too much (for my taste), i.e.
>> irrespective of these PUSH/POP all getting NOP-ed out by alternatives
>> patching. (As an aside I consider fiddling with the stack pointer in
>> inline asm() risky, and hence I would prefer to avoid such whenever
>> possible.
> 
> Is this because we are likely to not end up with a proper trace if we
> mess up with the stack pointer before a function call?

That's a related issue, but not what I was thinking about.

>  I would like
> to better understand your concerns with the stack pointer and inline
> asm().

You can't use local variables anymore with "m" or "rm" constraints, as
they might be accessed via %rsp.

> Other options would be using no_caller_saved_registers, but that's not
> available in all supported gcc versions, likely the same with clang,
> but I wouldn't mind upping the minimal clang version.

Right, you did suggest using this attribute before. But we continue to
support older gcc.

> What about allocating the size of the registers that need saving as an
> on-stack variable visible to the compiler and then moving to/from
> there in the inline asm()?

That would address the fiddling-with-%rsp concern, but would further
grow mainline code size.

>> Yes, it can be written so it's independent of what the
>> compiler thinks the stack pointer points at, but such constructs are
>> fragile as soon as one wants to change them a little later on.)
>>
>> Are you perhaps thinking of indeed having the PUSH/POP in mainline
>> code? Or some entirely different model?
>>
>> What I could see us doing to accommodate Clang is to have wrappers
>> around the actual functions which do as you suggest and preserve all
>> relevant registers, no matter whether they're used. That'll still be
>> assembly code to maintain, but imo less of a concern than in Andrew's
>> model of writing hweight<N>() themselves in assembly (provided I
>> understood him correctly), for being purely mechanical operations.
> 
> If possible I would prefer to use the same model for both gcc and
> clang, or else things tend to get more complicated than strictly
> needed.

We can always decide to accept the extra overhead even with gcc.

> I also wonder whether we could also get away without disabling of
> coverage and ubsan for the hweight object file.  But I assume as long
> ass we do the function call in inline asm() (and thus kind of hidden
> from the compiler) we need to disable any instrumentation because the
> changed function context.  I don't think there's anyway we can
> manually add the function call prefix/suffix in the inline asm()?

I don't know whether that would be possible (and portable across
versions). But what I'm more concerned about is that such functions
may also end up clobbering certain call-clobbered registers. (Note
that when writing these in assembly, as suggested by Andrew, no such
hooks would be used either.)

Jan

Reply via email to