On 10/01/2012 09:15, David Terei wrote:
Hi Simon,

This is finally fixed!

LLVM now creates this code for Updates.cmm:

stg_upd_frame_info:                     # @stg_upd_frame_info
# BB#0:                                 # %cL
    subq  $24, %rsp
    movq  8(%rbp), %rdx
    movq  %rbx, 8(%rdx)
    movq  $stg_BLACKHOLE_info, (%rdx)
    movq  %rdx, %rax
    shrq  $6, %rax
    andq  $16320, %rax            # imm = 0x3FC0
    movq  %rdx, %rcx
    andq  $-1048576, %rcx         # imm = 0xFFFFFFFFFFF00000
    orq   %rax, %rcx
    movswq   40(%rcx), %rax
    addq  $16, %rbp   testq %rax, %rax
    movq  %r15, %rcx
    je .LBB0_4

compared to the NCG which creates:

stg_upd_frame_info:
.LcL:
    movq 8(%rbp),%rax
    movq %rax,64(%rsp)
    addq $16,%rbp
    movq %rbx,8(%rax)
    movq $stg_BLACKHOLE_info,0(%rax)
    movq %rax,%rcx
    andq $-1048576,%rcx
    andq $1044480,%rax
    shrq $6,%rax
    orq %rcx,%rax
    cmpw $0,40(%rax)
    jne .LcH
    jmp *0(%rbp).LcH:

See ticket:>  http://hackage.haskell.org/trac/ghc/ticket/4308 for
respective commits.

It's great that you're now tracking the liveness of GlobalRegs, but the LLVM code here still doesn't look as good. Any idea why? (this bit of code is executed a *lot*, saving a few instructions here is worthwhile)

Cheers,
        Simon



Cheers,
David

On 13 September 2010 18:49, David Terei<davidte...@gmail.com>  wrote:
OK, I've created a ticket for this issue here
http://hackage.haskell.org/trac/ghc/ticket/4308

On 13 September 2010 18:29, Simon Marlow<marlo...@gmail.com>  wrote:
On 13/09/2010 09:03, David Terei wrote:

Hi Simon,

Hmm, I've been trying to find a nasty performance regression when ghc
is bootstrapped with llvm, this problem must be a large part of this
so thanks.

Right, I expect the mfence on every update is costing quite a lot (one of
those can be hundreds of cycles).

Thanks for looking at this!

Cheers,
`       Simon


  - All the spilling.  In the slow path there's a foreign call (to
allocBlocks_lock)
    but that is annotated with the live registers, in this case [R1], so I
don't
    understand why LLVM should be spilling everything.  Any ideas?

I fixed a problem like this ox x86 a month or two  back. The issue is
this:

fun f (Base, Hp, Sp, R1, R2, R3, R4, R5, R6) {
   // do some stuff
   call foreign g() [R1];
   // do some stuff
   tail call f_next (Base, Hp', Sp', R1', R2, R3, R4, R5, R6);
}

Because of the use of the calling convention to pass the stg registers
around llvm thinks they are all live across the call in the code
above, since well they are. The fix is to explicitly kill the
registers that aren't live across the call, which is easy as llvm
provides a nice symbolic 'undef' value that can do just this. So above
code needs to become:

fun f (Base, Hp, Sp, R1, R2, R3, R4, R5, R6) {
   // do some stuff
   call foreign g() [R1];
   // do some stuff
   tail call f_next (Base, Hp', Sp', R1', undef, undef, undef, undef,
undef);
}

What I actually do is slightly different then this but the concept is
the same. Hope that makes sense. As I said I thought I had made this
fix a while back but hand written Cmm generally tests very different
code paths than compiler generated Cmm, so I must have missed a case.
This should be an easy fix as well.

I played around a little with compiling Updates.cmm on x86-32, will
try soon on x64. The ncg gives this assembly (annotated with the
corresponding cmm for my benefit):

_stg_upd_frame_info:
.Lcz:
   movl 4(%ebp),%eax                 # bits32 updatee = b32[Sp +
SIZEOF_StgHeader]
   movl %eax,64(%esp)                # spill updatee
   addl $8,%ebp                      # Sp = Sp + (SIZEOF_StgHeader + 4)
   movl %esi,4(%eax)                 # gcptr[updatee + SIZEOF_StgHeader] =
R1
                                     # prim %write_barrier() [];
   movl $_stg_BLACKHOLE_info,0(%eax) # b32[updatee] = stg_BLACKHOLE_info

   movl %eax,%ecx                    # copy updatee
   andl $-1048576,%ecx               # x = updatee&     ((1<<    20) - 1)&
~((1<<    12) - 1)
   andl $1044480,%eax                # y = updatee&    ~((1<<    20) - 1)
   shrl $7,%eax                      # x>>    (12 - 5)
   orl %ecx,%eax                     # bd = x&    y

   cmpw $0,28(%eax)                  #
   jne .LcA                          # if ( b16[bd + 28] != 0): goto .LcA

   jmp *0(%ebp)                      # else: jump %ENTRY_CODE( bits32[Sp])


If I compile it with llvm (manually fixing the write barrier issue) I
get the following llvm code:

define cc10 void @stg_upd_frame_info(i32* noalias nocapture %Base_Arg,
i32* noalias nocapture %Sp_Arg, i32* noalias nocapture %Hp_Arg, i32
%R1_Arg) nounwind section ".text; .text 2#" align 4 {
c2K:
   %ln2M = getelementptr inbounds i32* %Sp_Arg, i32 1
   %ln2Q = load i32* %ln2M
   %ln2V = getelementptr inbounds i32* %Sp_Arg, i32 2
   %ln2Y = add i32 %ln2Q, 4
   %ln30 = inttoptr i32 %ln2Y to i32*
   store i32 %R1_Arg, i32* %ln30
   %ln34 = inttoptr i32 %ln2Q to i32*
   store i32 ptrtoint ([0 x i32]* @stg_BLACKHOLE_info to i32), i32* %ln34
   %ln3c = lshr i32 %ln2Q, 7
   %ln3e = and i32 %ln3c, 8160
   %ln3j = and i32 %ln2Q, -1048576
   %ln3k = or i32 %ln3e, %ln3j
   %ln3m1 = or i32 %ln3k, 28
   %ln3n = inttoptr i32 %ln3m1 to i16*
   %ln3o = load i16* %ln3n, align 4
   %ln3p = icmp eq i16 %ln3o, 0
   br i1 %ln3p, label %n3r, label %c3q

n3r:                                              ; preds = %c2K
   %ln3x = load i32* %ln2V
   %ln3y = inttoptr i32 %ln3x to void (i32*, i32*, i32*, i32)*
   tail call cc10 void %ln3y(i32* %Base_Arg, i32* %ln2V, i32* %Hp_Arg,
i32 %R1_Arg) nounwind
   ret void

Which seems quite good to me. That compiles to the assembly:

_stg_upd_frame_info:
   subl $20, %esp
   movl %edi, 16(%esp)          # 4-byte Spill
   movl 4(%ebp), %edi
   addl $8, %ebp
   movl %esi, 8(%esp)           # 4-byte Spill
   movl %edi, %ecx
   movl %edi, %eax
   shrl $7, %ecx
   andl $-1048576, %eax         # imm = 0xFFFFFFFFFFF00000
   andl $8160, %ecx             # imm = 0x1FE0
   addl %eax, %ecx
   movl %esi, 4(%edi)
   movl $_stg_BLACKHOLE_info, (%edi)
   movswl       28(%ecx), %eax
   movl %eax, 12(%esp)          # 4-byte Spill
   testl        %eax, %eax
   je   LBB1_4
[...]
LBB1_4:                                 # %n3r
   movl (%ebp), %eax
   movl 16(%esp), %edi          # 4-byte Reload
   addl $20, %esp
   jmpl *%eax  # TAILCALL

Which is OK but still worse than the ncg. Improving this will probably
require talking to the llvm guys, I think the llvm register allocator
may have some asssumptions/design decisions that interact badly with
our calling convention.

Roman and I were a while ago investigating an issue where llvm wasn't
doing a very good job for some dph code, quite a few unnecessary
spills. We thought it was an aliasing issue but in the end seemed to
be an llvm problem with the instruction selector/scheduler creating a
lot of register pressure.

Cheers,
David


I don't expect us to do better than the NCG here, because the NCG code is
just about optimal, but I would like to use -fllvm on other parts of the RTS
code so it would be good if we could generate code that is at least as good
as the NCG here.

Cheers,
        Simon



_______________________________________________
Cvs-ghc mailing list
Cvs-ghc@haskell.org
http://www.haskell.org/mailman/listinfo/cvs-ghc




_______________________________________________
Cvs-ghc mailing list
Cvs-ghc@haskell.org
http://www.haskell.org/mailman/listinfo/cvs-ghc

Reply via email to