Hi Simon, This is finally fixed!
LLVM now creates this code for Updates.cmm: stg_upd_frame_info: # @stg_upd_frame_info # BB#0: # %cL subq $24, %rsp movq 8(%rbp), %rdx movq %rbx, 8(%rdx) movq $stg_BLACKHOLE_info, (%rdx) movq %rdx, %rax shrq $6, %rax andq $16320, %rax # imm = 0x3FC0 movq %rdx, %rcx andq $-1048576, %rcx # imm = 0xFFFFFFFFFFF00000 orq %rax, %rcx movswq 40(%rcx), %rax addq $16, %rbp testq %rax, %rax movq %r15, %rcx je .LBB0_4 compared to the NCG which creates: stg_upd_frame_info: .LcL: movq 8(%rbp),%rax movq %rax,64(%rsp) addq $16,%rbp movq %rbx,8(%rax) movq $stg_BLACKHOLE_info,0(%rax) movq %rax,%rcx andq $-1048576,%rcx andq $1044480,%rax shrq $6,%rax orq %rcx,%rax cmpw $0,40(%rax) jne .LcH jmp *0(%rbp).LcH: See ticket: > http://hackage.haskell.org/trac/ghc/ticket/4308 for respective commits. Cheers, David On 13 September 2010 18:49, David Terei <davidte...@gmail.com> wrote: > OK, I've created a ticket for this issue here > http://hackage.haskell.org/trac/ghc/ticket/4308 > > On 13 September 2010 18:29, Simon Marlow <marlo...@gmail.com> wrote: >> On 13/09/2010 09:03, David Terei wrote: >>> >>> Hi Simon, >>> >>> Hmm, I've been trying to find a nasty performance regression when ghc >>> is bootstrapped with llvm, this problem must be a large part of this >>> so thanks. >> >> Right, I expect the mfence on every update is costing quite a lot (one of >> those can be hundreds of cycles). >> >> Thanks for looking at this! >> >> Cheers, >> ` Simon >> >> >>>> - All the spilling. In the slow path there's a foreign call (to >>>> allocBlocks_lock) >>>> but that is annotated with the live registers, in this case [R1], so I >>>> don't >>>> understand why LLVM should be spilling everything. Any ideas? >>> >>> I fixed a problem like this ox x86 a month or two back. The issue is >>> this: >>> >>> fun f (Base, Hp, Sp, R1, R2, R3, R4, R5, R6) { >>> // do some stuff >>> call foreign g() [R1]; >>> // do some stuff >>> tail call f_next (Base, Hp', Sp', R1', R2, R3, R4, R5, R6); >>> } >>> >>> Because of the use of the calling convention to pass the stg registers >>> around llvm thinks they are all live across the call in the code >>> above, since well they are. The fix is to explicitly kill the >>> registers that aren't live across the call, which is easy as llvm >>> provides a nice symbolic 'undef' value that can do just this. So above >>> code needs to become: >>> >>> fun f (Base, Hp, Sp, R1, R2, R3, R4, R5, R6) { >>> // do some stuff >>> call foreign g() [R1]; >>> // do some stuff >>> tail call f_next (Base, Hp', Sp', R1', undef, undef, undef, undef, >>> undef); >>> } >>> >>> What I actually do is slightly different then this but the concept is >>> the same. Hope that makes sense. As I said I thought I had made this >>> fix a while back but hand written Cmm generally tests very different >>> code paths than compiler generated Cmm, so I must have missed a case. >>> This should be an easy fix as well. >>> >>> I played around a little with compiling Updates.cmm on x86-32, will >>> try soon on x64. The ncg gives this assembly (annotated with the >>> corresponding cmm for my benefit): >>> >>> _stg_upd_frame_info: >>> .Lcz: >>> movl 4(%ebp),%eax # bits32 updatee = b32[Sp + >>> SIZEOF_StgHeader] >>> movl %eax,64(%esp) # spill updatee >>> addl $8,%ebp # Sp = Sp + (SIZEOF_StgHeader + 4) >>> movl %esi,4(%eax) # gcptr[updatee + SIZEOF_StgHeader] = >>> R1 >>> # prim %write_barrier() []; >>> movl $_stg_BLACKHOLE_info,0(%eax) # b32[updatee] = stg_BLACKHOLE_info >>> >>> movl %eax,%ecx # copy updatee >>> andl $-1048576,%ecx # x = updatee& ((1<< 20) - 1)& >>> ~((1<< 12) - 1) >>> andl $1044480,%eax # y = updatee& ~((1<< 20) - 1) >>> shrl $7,%eax # x>> (12 - 5) >>> orl %ecx,%eax # bd = x& y >>> >>> cmpw $0,28(%eax) # >>> jne .LcA # if ( b16[bd + 28] != 0): goto .LcA >>> >>> jmp *0(%ebp) # else: jump %ENTRY_CODE( bits32[Sp]) >>> >>> >>> If I compile it with llvm (manually fixing the write barrier issue) I >>> get the following llvm code: >>> >>> define cc10 void @stg_upd_frame_info(i32* noalias nocapture %Base_Arg, >>> i32* noalias nocapture %Sp_Arg, i32* noalias nocapture %Hp_Arg, i32 >>> %R1_Arg) nounwind section ".text; .text 2#" align 4 { >>> c2K: >>> %ln2M = getelementptr inbounds i32* %Sp_Arg, i32 1 >>> %ln2Q = load i32* %ln2M >>> %ln2V = getelementptr inbounds i32* %Sp_Arg, i32 2 >>> %ln2Y = add i32 %ln2Q, 4 >>> %ln30 = inttoptr i32 %ln2Y to i32* >>> store i32 %R1_Arg, i32* %ln30 >>> %ln34 = inttoptr i32 %ln2Q to i32* >>> store i32 ptrtoint ([0 x i32]* @stg_BLACKHOLE_info to i32), i32* %ln34 >>> %ln3c = lshr i32 %ln2Q, 7 >>> %ln3e = and i32 %ln3c, 8160 >>> %ln3j = and i32 %ln2Q, -1048576 >>> %ln3k = or i32 %ln3e, %ln3j >>> %ln3m1 = or i32 %ln3k, 28 >>> %ln3n = inttoptr i32 %ln3m1 to i16* >>> %ln3o = load i16* %ln3n, align 4 >>> %ln3p = icmp eq i16 %ln3o, 0 >>> br i1 %ln3p, label %n3r, label %c3q >>> >>> n3r: ; preds = %c2K >>> %ln3x = load i32* %ln2V >>> %ln3y = inttoptr i32 %ln3x to void (i32*, i32*, i32*, i32)* >>> tail call cc10 void %ln3y(i32* %Base_Arg, i32* %ln2V, i32* %Hp_Arg, >>> i32 %R1_Arg) nounwind >>> ret void >>> >>> Which seems quite good to me. That compiles to the assembly: >>> >>> _stg_upd_frame_info: >>> subl $20, %esp >>> movl %edi, 16(%esp) # 4-byte Spill >>> movl 4(%ebp), %edi >>> addl $8, %ebp >>> movl %esi, 8(%esp) # 4-byte Spill >>> movl %edi, %ecx >>> movl %edi, %eax >>> shrl $7, %ecx >>> andl $-1048576, %eax # imm = 0xFFFFFFFFFFF00000 >>> andl $8160, %ecx # imm = 0x1FE0 >>> addl %eax, %ecx >>> movl %esi, 4(%edi) >>> movl $_stg_BLACKHOLE_info, (%edi) >>> movswl 28(%ecx), %eax >>> movl %eax, 12(%esp) # 4-byte Spill >>> testl %eax, %eax >>> je LBB1_4 >>> [...] >>> LBB1_4: # %n3r >>> movl (%ebp), %eax >>> movl 16(%esp), %edi # 4-byte Reload >>> addl $20, %esp >>> jmpl *%eax # TAILCALL >>> >>> Which is OK but still worse than the ncg. Improving this will probably >>> require talking to the llvm guys, I think the llvm register allocator >>> may have some asssumptions/design decisions that interact badly with >>> our calling convention. >>> >>> Roman and I were a while ago investigating an issue where llvm wasn't >>> doing a very good job for some dph code, quite a few unnecessary >>> spills. We thought it was an aliasing issue but in the end seemed to >>> be an llvm problem with the instruction selector/scheduler creating a >>> lot of register pressure. >>> >>> Cheers, >>> David >>> >>>> >>>> I don't expect us to do better than the NCG here, because the NCG code is >>>> just about optimal, but I would like to use -fllvm on other parts of the >>>> RTS >>>> code so it would be good if we could generate code that is at least as good >>>> as the NCG here. >>>> >>>> Cheers, >>>> Simon >>>> >>>> >>> >>> _______________________________________________ >>> Cvs-ghc mailing list >>> Cvs-ghc@haskell.org >>> http://www.haskell.org/mailman/listinfo/cvs-ghc >> >> _______________________________________________ Cvs-ghc mailing list Cvs-ghc@haskell.org http://www.haskell.org/mailman/listinfo/cvs-ghc