On Fri, Sep 05, 2025 at 09:19:29AM -0700, Kees Cook wrote:
> On Fri, Sep 05, 2025 at 10:51:03AM +0200, Peter Zijlstra wrote:
> > On Thu, Sep 04, 2025 at 05:24:10PM -0700, Kees Cook wrote:
> > > +- The check-call instruction sequence must be treated a single unit: it
> > > +  cannot be rearranged or split or optimized. The pattern is that
> > > +  indirect calls, "call *$target", get converted into:
> > > +
> > > +    mov $target_expression, %target ; only present if the expression was
> > > +                                    ; not already %target register
> > > +    load -$offset(%target), %tmp    ; load the typeid hash at target
> > > +    cmp $hash, %tmp                 ; compare expected typeid with loaded
> > > +    je .Lcheck_passed               ; jump to the indirect call
> > > +  .Lkcfi_trap$N:                    ; label of trap insn
> > > +    trap                            ; trap on failure, but arranged so
> > > +                                    ; "permissive mode" falls through
> > > +  .Lkcfi_call$N:                    ; label of call insn
> > > +    call *%target                   ; actual indirect call
> > > +
> > > +  This pattern of call immediately after trap provides for the
> > > +  "permissive" checking mode automatically: the trap gets handled,
> > > +  a warning emitted, and then execution continues after the trap to
> > > +  the call.
> > 
> > I know it is far too late to do anything here. But I've recently dug
> > through a bunch of optimization manual and the like and that Jcc is
> > about as bad as it gets :/
> > 
> > The old optimization manual states that forward jumps are assumed
> > not-taken; while backward jumps are assumed taken.
> > 
> > The new wisdom is that any Jcc must be assumed not-taken; that is, the
> > fallthrough case has the best throughput.
> 
> I would expect the cmp to be the slowest part of this sequence, and I
> figured the both the trap and the call to be speculation barriers? I'm
> not sure, though. Is changing the sequence actually useful?

The load can miss, in which case it is definitely the most expensive
thing around.

> > Here we have a forward branch which is assumed taken :-(
> 
> The constraints we have are:
> 
> - Linux x86 KCFI trap handler decodes the instructions from the trap
>   backwards, but it uses exact offsets (-12 and -6).
> - Control flow following the trap must make the call (for warn-only mode)
> 
> If we change this, we'd need to make the insn decoder smarter to likey
> look at the insn AFTER the trap ("is it a direct jump?")
> 
> And then use this, which is ugly, but matches second constraint:
> 
>       cmp $hash %tmp
>       jne .Ltrap
> .Lcall:
>       call *%target
>       jmp .Ldone
> .Ltrap:
>       ud2
>       jmp .Lcall
> .Ldone:

Ah, you can do something like:

        cmp $hash, %tmp
        jne +3
        nopl -42(%rax)
        call *%target

which is only 2 bytes longer. Notably, that nopl is 4 bytes and the 4th
byte is 0xd6 (aka UDB). This is an effective UDcc instruction based
around a forward non-taken branch.

But yeah, I don't know if it is worth changing this. Its just that I've
been staring at these things far too much of late :-)

Reply via email to