Hi Simon,

This is finally fixed!

LLVM now creates this code for Updates.cmm:

stg_upd_frame_info:                     # @stg_upd_frame_info
# BB#0:                                 # %cL
   subq  $24, %rsp
   movq  8(%rbp), %rdx
   movq  %rbx, 8(%rdx)
   movq  $stg_BLACKHOLE_info, (%rdx)
   movq  %rdx, %rax
   shrq  $6, %rax
   andq  $16320, %rax            # imm = 0x3FC0
   movq  %rdx, %rcx
   andq  $-1048576, %rcx         # imm = 0xFFFFFFFFFFF00000
   orq   %rax, %rcx
   movswq   40(%rcx), %rax
   addq  $16, %rbp   testq %rax, %rax
   movq  %r15, %rcx
   je .LBB0_4

compared to the NCG which creates:

stg_upd_frame_info:
.LcL:
   movq 8(%rbp),%rax
   movq %rax,64(%rsp)
   addq $16,%rbp
   movq %rbx,8(%rax)
   movq $stg_BLACKHOLE_info,0(%rax)
   movq %rax,%rcx
   andq $-1048576,%rcx
   andq $1044480,%rax
   shrq $6,%rax
   orq %rcx,%rax
   cmpw $0,40(%rax)
   jne .LcH
   jmp *0(%rbp).LcH:

See ticket: > http://hackage.haskell.org/trac/ghc/ticket/4308 for
respective commits.

Cheers,
David

On 13 September 2010 18:49, David Terei <davidte...@gmail.com> wrote:
> OK, I've created a ticket for this issue here
> http://hackage.haskell.org/trac/ghc/ticket/4308
>
> On 13 September 2010 18:29, Simon Marlow <marlo...@gmail.com> wrote:
>> On 13/09/2010 09:03, David Terei wrote:
>>>
>>> Hi Simon,
>>>
>>> Hmm, I've been trying to find a nasty performance regression when ghc
>>> is bootstrapped with llvm, this problem must be a large part of this
>>> so thanks.
>>
>> Right, I expect the mfence on every update is costing quite a lot (one of
>> those can be hundreds of cycles).
>>
>> Thanks for looking at this!
>>
>> Cheers,
>> `       Simon
>>
>>
>>>>  - All the spilling.  In the slow path there's a foreign call (to
>>>> allocBlocks_lock)
>>>>    but that is annotated with the live registers, in this case [R1], so I
>>>> don't
>>>>    understand why LLVM should be spilling everything.  Any ideas?
>>>
>>> I fixed a problem like this ox x86 a month or two  back. The issue is
>>> this:
>>>
>>> fun f (Base, Hp, Sp, R1, R2, R3, R4, R5, R6) {
>>>   // do some stuff
>>>   call foreign g() [R1];
>>>   // do some stuff
>>>   tail call f_next (Base, Hp', Sp', R1', R2, R3, R4, R5, R6);
>>> }
>>>
>>> Because of the use of the calling convention to pass the stg registers
>>> around llvm thinks they are all live across the call in the code
>>> above, since well they are. The fix is to explicitly kill the
>>> registers that aren't live across the call, which is easy as llvm
>>> provides a nice symbolic 'undef' value that can do just this. So above
>>> code needs to become:
>>>
>>> fun f (Base, Hp, Sp, R1, R2, R3, R4, R5, R6) {
>>>   // do some stuff
>>>   call foreign g() [R1];
>>>   // do some stuff
>>>   tail call f_next (Base, Hp', Sp', R1', undef, undef, undef, undef,
>>> undef);
>>> }
>>>
>>> What I actually do is slightly different then this but the concept is
>>> the same. Hope that makes sense. As I said I thought I had made this
>>> fix a while back but hand written Cmm generally tests very different
>>> code paths than compiler generated Cmm, so I must have missed a case.
>>> This should be an easy fix as well.
>>>
>>> I played around a little with compiling Updates.cmm on x86-32, will
>>> try soon on x64. The ncg gives this assembly (annotated with the
>>> corresponding cmm for my benefit):
>>>
>>> _stg_upd_frame_info:
>>> .Lcz:
>>>   movl 4(%ebp),%eax                 # bits32 updatee = b32[Sp +
>>> SIZEOF_StgHeader]
>>>   movl %eax,64(%esp)                # spill updatee
>>>   addl $8,%ebp                      # Sp = Sp + (SIZEOF_StgHeader + 4)
>>>   movl %esi,4(%eax)                 # gcptr[updatee + SIZEOF_StgHeader] =
>>> R1
>>>                                     # prim %write_barrier() [];
>>>   movl $_stg_BLACKHOLE_info,0(%eax) # b32[updatee] = stg_BLACKHOLE_info
>>>
>>>   movl %eax,%ecx                    # copy updatee
>>>   andl $-1048576,%ecx               # x = updatee&   ((1<<  20) - 1)&
>>> ~((1<<  12) - 1)
>>>   andl $1044480,%eax                # y = updatee&  ~((1<<  20) - 1)
>>>   shrl $7,%eax                      # x>>  (12 - 5)
>>>   orl %ecx,%eax                     # bd = x&  y
>>>
>>>   cmpw $0,28(%eax)                  #
>>>   jne .LcA                          # if ( b16[bd + 28] != 0): goto .LcA
>>>
>>>   jmp *0(%ebp)                      # else: jump %ENTRY_CODE( bits32[Sp])
>>>
>>>
>>> If I compile it with llvm (manually fixing the write barrier issue) I
>>> get the following llvm code:
>>>
>>> define cc10 void @stg_upd_frame_info(i32* noalias nocapture %Base_Arg,
>>> i32* noalias nocapture %Sp_Arg, i32* noalias nocapture %Hp_Arg, i32
>>> %R1_Arg) nounwind section ".text; .text 2#" align 4 {
>>> c2K:
>>>   %ln2M = getelementptr inbounds i32* %Sp_Arg, i32 1
>>>   %ln2Q = load i32* %ln2M
>>>   %ln2V = getelementptr inbounds i32* %Sp_Arg, i32 2
>>>   %ln2Y = add i32 %ln2Q, 4
>>>   %ln30 = inttoptr i32 %ln2Y to i32*
>>>   store i32 %R1_Arg, i32* %ln30
>>>   %ln34 = inttoptr i32 %ln2Q to i32*
>>>   store i32 ptrtoint ([0 x i32]* @stg_BLACKHOLE_info to i32), i32* %ln34
>>>   %ln3c = lshr i32 %ln2Q, 7
>>>   %ln3e = and i32 %ln3c, 8160
>>>   %ln3j = and i32 %ln2Q, -1048576
>>>   %ln3k = or i32 %ln3e, %ln3j
>>>   %ln3m1 = or i32 %ln3k, 28
>>>   %ln3n = inttoptr i32 %ln3m1 to i16*
>>>   %ln3o = load i16* %ln3n, align 4
>>>   %ln3p = icmp eq i16 %ln3o, 0
>>>   br i1 %ln3p, label %n3r, label %c3q
>>>
>>> n3r:                                              ; preds = %c2K
>>>   %ln3x = load i32* %ln2V
>>>   %ln3y = inttoptr i32 %ln3x to void (i32*, i32*, i32*, i32)*
>>>   tail call cc10 void %ln3y(i32* %Base_Arg, i32* %ln2V, i32* %Hp_Arg,
>>> i32 %R1_Arg) nounwind
>>>   ret void
>>>
>>> Which seems quite good to me. That compiles to the assembly:
>>>
>>> _stg_upd_frame_info:
>>>   subl $20, %esp
>>>   movl %edi, 16(%esp)          # 4-byte Spill
>>>   movl 4(%ebp), %edi
>>>   addl $8, %ebp
>>>   movl %esi, 8(%esp)           # 4-byte Spill
>>>   movl %edi, %ecx
>>>   movl %edi, %eax
>>>   shrl $7, %ecx
>>>   andl $-1048576, %eax         # imm = 0xFFFFFFFFFFF00000
>>>   andl $8160, %ecx             # imm = 0x1FE0
>>>   addl %eax, %ecx
>>>   movl %esi, 4(%edi)
>>>   movl $_stg_BLACKHOLE_info, (%edi)
>>>   movswl       28(%ecx), %eax
>>>   movl %eax, 12(%esp)          # 4-byte Spill
>>>   testl        %eax, %eax
>>>   je   LBB1_4
>>> [...]
>>> LBB1_4:                                 # %n3r
>>>   movl (%ebp), %eax
>>>   movl 16(%esp), %edi          # 4-byte Reload
>>>   addl $20, %esp
>>>   jmpl *%eax  # TAILCALL
>>>
>>> Which is OK but still worse than the ncg. Improving this will probably
>>> require talking to the llvm guys, I think the llvm register allocator
>>> may have some asssumptions/design decisions that interact badly with
>>> our calling convention.
>>>
>>> Roman and I were a while ago investigating an issue where llvm wasn't
>>> doing a very good job for some dph code, quite a few unnecessary
>>> spills. We thought it was an aliasing issue but in the end seemed to
>>> be an llvm problem with the instruction selector/scheduler creating a
>>> lot of register pressure.
>>>
>>> Cheers,
>>> David
>>>
>>>>
>>>> I don't expect us to do better than the NCG here, because the NCG code is
>>>> just about optimal, but I would like to use -fllvm on other parts of the 
>>>> RTS
>>>> code so it would be good if we could generate code that is at least as good
>>>> as the NCG here.
>>>>
>>>> Cheers,
>>>>        Simon
>>>>
>>>>
>>>
>>> _______________________________________________
>>> Cvs-ghc mailing list
>>> Cvs-ghc@haskell.org
>>> http://www.haskell.org/mailman/listinfo/cvs-ghc
>>
>>

_______________________________________________
Cvs-ghc mailing list
Cvs-ghc@haskell.org
http://www.haskell.org/mailman/listinfo/cvs-ghc

Reply via email to