On 10 January 2012 01:52, Simon Marlow <marlo...@gmail.com> wrote: > It's great that you're now tracking the liveness of GlobalRegs, but the LLVM > code here still doesn't look as good. Any idea why? (this bit of code is > executed a *lot*, saving a few instructions here is worthwhile)
No idea at moment. I was disappointed it wasn't as good still as well after the work that went into tracking liveness. I played around just briefly with different LLVM versions and different register allocators and got some pretty different results at times. The code I posted is actually one of the better results (LLVM 2.9, greedy allocator). LLVM 3.0 with greedy allocator is slightly worse actually, deciding to spill two registers in the fast path. When I have time I'll try to play around some more. This is a nice test case since its small and important and we have an optimal benchmark to compare with so if its just LLVM being stupid we can make a fuss about it to them. > > Cheers, > Simon > > > > >> Cheers, >> David >> >> On 13 September 2010 18:49, David Terei<davidte...@gmail.com> wrote: >>> >>> OK, I've created a ticket for this issue here >>> http://hackage.haskell.org/trac/ghc/ticket/4308 >>> >>> On 13 September 2010 18:29, Simon Marlow<marlo...@gmail.com> wrote: >>>> >>>> On 13/09/2010 09:03, David Terei wrote: >>>>> >>>>> >>>>> Hi Simon, >>>>> >>>>> Hmm, I've been trying to find a nasty performance regression when ghc >>>>> is bootstrapped with llvm, this problem must be a large part of this >>>>> so thanks. >>>> >>>> >>>> Right, I expect the mfence on every update is costing quite a lot (one >>>> of >>>> those can be hundreds of cycles). >>>> >>>> Thanks for looking at this! >>>> >>>> Cheers, >>>> ` Simon >>>> >>>> >>>>>> - All the spilling. In the slow path there's a foreign call (to >>>>>> allocBlocks_lock) >>>>>> but that is annotated with the live registers, in this case [R1], >>>>>> so I >>>>>> don't >>>>>> understand why LLVM should be spilling everything. Any ideas? >>>>> >>>>> >>>>> I fixed a problem like this ox x86 a month or two back. The issue is >>>>> this: >>>>> >>>>> fun f (Base, Hp, Sp, R1, R2, R3, R4, R5, R6) { >>>>> // do some stuff >>>>> call foreign g() [R1]; >>>>> // do some stuff >>>>> tail call f_next (Base, Hp', Sp', R1', R2, R3, R4, R5, R6); >>>>> } >>>>> >>>>> Because of the use of the calling convention to pass the stg registers >>>>> around llvm thinks they are all live across the call in the code >>>>> above, since well they are. The fix is to explicitly kill the >>>>> registers that aren't live across the call, which is easy as llvm >>>>> provides a nice symbolic 'undef' value that can do just this. So above >>>>> code needs to become: >>>>> >>>>> fun f (Base, Hp, Sp, R1, R2, R3, R4, R5, R6) { >>>>> // do some stuff >>>>> call foreign g() [R1]; >>>>> // do some stuff >>>>> tail call f_next (Base, Hp', Sp', R1', undef, undef, undef, undef, >>>>> undef); >>>>> } >>>>> >>>>> What I actually do is slightly different then this but the concept is >>>>> the same. Hope that makes sense. As I said I thought I had made this >>>>> fix a while back but hand written Cmm generally tests very different >>>>> code paths than compiler generated Cmm, so I must have missed a case. >>>>> This should be an easy fix as well. >>>>> >>>>> I played around a little with compiling Updates.cmm on x86-32, will >>>>> try soon on x64. The ncg gives this assembly (annotated with the >>>>> corresponding cmm for my benefit): >>>>> >>>>> _stg_upd_frame_info: >>>>> .Lcz: >>>>> movl 4(%ebp),%eax # bits32 updatee = b32[Sp + >>>>> SIZEOF_StgHeader] >>>>> movl %eax,64(%esp) # spill updatee >>>>> addl $8,%ebp # Sp = Sp + (SIZEOF_StgHeader + 4) >>>>> movl %esi,4(%eax) # gcptr[updatee + SIZEOF_StgHeader] >>>>> = >>>>> R1 >>>>> # prim %write_barrier() []; >>>>> movl $_stg_BLACKHOLE_info,0(%eax) # b32[updatee] = stg_BLACKHOLE_info >>>>> >>>>> movl %eax,%ecx # copy updatee >>>>> andl $-1048576,%ecx # x = updatee& ((1<< 20) - >>>>> 1)& >>>>> ~((1<< 12) - 1) >>>>> andl $1044480,%eax # y = updatee& ~((1<< 20) - >>>>> 1) >>>>> shrl $7,%eax # x>> (12 - 5) >>>>> orl %ecx,%eax # bd = x& y >>>>> >>>>> cmpw $0,28(%eax) # >>>>> jne .LcA # if ( b16[bd + 28] != 0): goto >>>>> .LcA >>>>> >>>>> jmp *0(%ebp) # else: jump %ENTRY_CODE( >>>>> bits32[Sp]) >>>>> >>>>> >>>>> If I compile it with llvm (manually fixing the write barrier issue) I >>>>> get the following llvm code: >>>>> >>>>> define cc10 void @stg_upd_frame_info(i32* noalias nocapture %Base_Arg, >>>>> i32* noalias nocapture %Sp_Arg, i32* noalias nocapture %Hp_Arg, i32 >>>>> %R1_Arg) nounwind section ".text; .text 2#" align 4 { >>>>> c2K: >>>>> %ln2M = getelementptr inbounds i32* %Sp_Arg, i32 1 >>>>> %ln2Q = load i32* %ln2M >>>>> %ln2V = getelementptr inbounds i32* %Sp_Arg, i32 2 >>>>> %ln2Y = add i32 %ln2Q, 4 >>>>> %ln30 = inttoptr i32 %ln2Y to i32* >>>>> store i32 %R1_Arg, i32* %ln30 >>>>> %ln34 = inttoptr i32 %ln2Q to i32* >>>>> store i32 ptrtoint ([0 x i32]* @stg_BLACKHOLE_info to i32), i32* >>>>> %ln34 >>>>> %ln3c = lshr i32 %ln2Q, 7 >>>>> %ln3e = and i32 %ln3c, 8160 >>>>> %ln3j = and i32 %ln2Q, -1048576 >>>>> %ln3k = or i32 %ln3e, %ln3j >>>>> %ln3m1 = or i32 %ln3k, 28 >>>>> %ln3n = inttoptr i32 %ln3m1 to i16* >>>>> %ln3o = load i16* %ln3n, align 4 >>>>> %ln3p = icmp eq i16 %ln3o, 0 >>>>> br i1 %ln3p, label %n3r, label %c3q >>>>> >>>>> n3r: ; preds = %c2K >>>>> %ln3x = load i32* %ln2V >>>>> %ln3y = inttoptr i32 %ln3x to void (i32*, i32*, i32*, i32)* >>>>> tail call cc10 void %ln3y(i32* %Base_Arg, i32* %ln2V, i32* %Hp_Arg, >>>>> i32 %R1_Arg) nounwind >>>>> ret void >>>>> >>>>> Which seems quite good to me. That compiles to the assembly: >>>>> >>>>> _stg_upd_frame_info: >>>>> subl $20, %esp >>>>> movl %edi, 16(%esp) # 4-byte Spill >>>>> movl 4(%ebp), %edi >>>>> addl $8, %ebp >>>>> movl %esi, 8(%esp) # 4-byte Spill >>>>> movl %edi, %ecx >>>>> movl %edi, %eax >>>>> shrl $7, %ecx >>>>> andl $-1048576, %eax # imm = 0xFFFFFFFFFFF00000 >>>>> andl $8160, %ecx # imm = 0x1FE0 >>>>> addl %eax, %ecx >>>>> movl %esi, 4(%edi) >>>>> movl $_stg_BLACKHOLE_info, (%edi) >>>>> movswl 28(%ecx), %eax >>>>> movl %eax, 12(%esp) # 4-byte Spill >>>>> testl %eax, %eax >>>>> je LBB1_4 >>>>> [...] >>>>> LBB1_4: # %n3r >>>>> movl (%ebp), %eax >>>>> movl 16(%esp), %edi # 4-byte Reload >>>>> addl $20, %esp >>>>> jmpl *%eax # TAILCALL >>>>> >>>>> Which is OK but still worse than the ncg. Improving this will probably >>>>> require talking to the llvm guys, I think the llvm register allocator >>>>> may have some asssumptions/design decisions that interact badly with >>>>> our calling convention. >>>>> >>>>> Roman and I were a while ago investigating an issue where llvm wasn't >>>>> doing a very good job for some dph code, quite a few unnecessary >>>>> spills. We thought it was an aliasing issue but in the end seemed to >>>>> be an llvm problem with the instruction selector/scheduler creating a >>>>> lot of register pressure. >>>>> >>>>> Cheers, >>>>> David >>>>> >>>>>> >>>>>> I don't expect us to do better than the NCG here, because the NCG code >>>>>> is >>>>>> just about optimal, but I would like to use -fllvm on other parts of >>>>>> the RTS >>>>>> code so it would be good if we could generate code that is at least as >>>>>> good >>>>>> as the NCG here. >>>>>> >>>>>> Cheers, >>>>>> Simon >>>>>> >>>>>> >>>>> >>>>> _______________________________________________ >>>>> Cvs-ghc mailing list >>>>> Cvs-ghc@haskell.org >>>>> http://www.haskell.org/mailman/listinfo/cvs-ghc >>>> >>>> >>>> > _______________________________________________ Cvs-ghc mailing list Cvs-ghc@haskell.org http://www.haskell.org/mailman/listinfo/cvs-ghc