- All the spilling. In the slow path there's a foreign call (to
allocBlocks_lock)
but that is annotated with the live registers, in this case [R1], so I
don't
understand why LLVM should be spilling everything. Any ideas?
I fixed a problem like this ox x86 a month or two back. The issue is
this:
fun f (Base, Hp, Sp, R1, R2, R3, R4, R5, R6) {
// do some stuff
call foreign g() [R1];
// do some stuff
tail call f_next (Base, Hp', Sp', R1', R2, R3, R4, R5, R6);
}
Because of the use of the calling convention to pass the stg registers
around llvm thinks they are all live across the call in the code
above, since well they are. The fix is to explicitly kill the
registers that aren't live across the call, which is easy as llvm
provides a nice symbolic 'undef' value that can do just this. So above
code needs to become:
fun f (Base, Hp, Sp, R1, R2, R3, R4, R5, R6) {
// do some stuff
call foreign g() [R1];
// do some stuff
tail call f_next (Base, Hp', Sp', R1', undef, undef, undef, undef,
undef);
}
What I actually do is slightly different then this but the concept is
the same. Hope that makes sense. As I said I thought I had made this
fix a while back but hand written Cmm generally tests very different
code paths than compiler generated Cmm, so I must have missed a case.
This should be an easy fix as well.
I played around a little with compiling Updates.cmm on x86-32, will
try soon on x64. The ncg gives this assembly (annotated with the
corresponding cmm for my benefit):
_stg_upd_frame_info:
.Lcz:
movl 4(%ebp),%eax # bits32 updatee = b32[Sp +
SIZEOF_StgHeader]
movl %eax,64(%esp) # spill updatee
addl $8,%ebp # Sp = Sp + (SIZEOF_StgHeader + 4)
movl %esi,4(%eax) # gcptr[updatee + SIZEOF_StgHeader] =
R1
# prim %write_barrier() [];
movl $_stg_BLACKHOLE_info,0(%eax) # b32[updatee] = stg_BLACKHOLE_info
movl %eax,%ecx # copy updatee
andl $-1048576,%ecx # x = updatee& ((1<< 20) - 1)&
~((1<< 12) - 1)
andl $1044480,%eax # y = updatee& ~((1<< 20) - 1)
shrl $7,%eax # x>> (12 - 5)
orl %ecx,%eax # bd = x& y
cmpw $0,28(%eax) #
jne .LcA # if ( b16[bd + 28] != 0): goto .LcA
jmp *0(%ebp) # else: jump %ENTRY_CODE( bits32[Sp])
If I compile it with llvm (manually fixing the write barrier issue) I
get the following llvm code:
define cc10 void @stg_upd_frame_info(i32* noalias nocapture %Base_Arg,
i32* noalias nocapture %Sp_Arg, i32* noalias nocapture %Hp_Arg, i32
%R1_Arg) nounwind section ".text; .text 2#" align 4 {
c2K:
%ln2M = getelementptr inbounds i32* %Sp_Arg, i32 1
%ln2Q = load i32* %ln2M
%ln2V = getelementptr inbounds i32* %Sp_Arg, i32 2
%ln2Y = add i32 %ln2Q, 4
%ln30 = inttoptr i32 %ln2Y to i32*
store i32 %R1_Arg, i32* %ln30
%ln34 = inttoptr i32 %ln2Q to i32*
store i32 ptrtoint ([0 x i32]* @stg_BLACKHOLE_info to i32), i32* %ln34
%ln3c = lshr i32 %ln2Q, 7
%ln3e = and i32 %ln3c, 8160
%ln3j = and i32 %ln2Q, -1048576
%ln3k = or i32 %ln3e, %ln3j
%ln3m1 = or i32 %ln3k, 28
%ln3n = inttoptr i32 %ln3m1 to i16*
%ln3o = load i16* %ln3n, align 4
%ln3p = icmp eq i16 %ln3o, 0
br i1 %ln3p, label %n3r, label %c3q
n3r: ; preds = %c2K
%ln3x = load i32* %ln2V
%ln3y = inttoptr i32 %ln3x to void (i32*, i32*, i32*, i32)*
tail call cc10 void %ln3y(i32* %Base_Arg, i32* %ln2V, i32* %Hp_Arg,
i32 %R1_Arg) nounwind
ret void
Which seems quite good to me. That compiles to the assembly:
_stg_upd_frame_info:
subl $20, %esp
movl %edi, 16(%esp) # 4-byte Spill
movl 4(%ebp), %edi
addl $8, %ebp
movl %esi, 8(%esp) # 4-byte Spill
movl %edi, %ecx
movl %edi, %eax
shrl $7, %ecx
andl $-1048576, %eax # imm = 0xFFFFFFFFFFF00000
andl $8160, %ecx # imm = 0x1FE0
addl %eax, %ecx
movl %esi, 4(%edi)
movl $_stg_BLACKHOLE_info, (%edi)
movswl 28(%ecx), %eax
movl %eax, 12(%esp) # 4-byte Spill
testl %eax, %eax
je LBB1_4
[...]
LBB1_4: # %n3r
movl (%ebp), %eax
movl 16(%esp), %edi # 4-byte Reload
addl $20, %esp
jmpl *%eax # TAILCALL
Which is OK but still worse than the ncg. Improving this will probably
require talking to the llvm guys, I think the llvm register allocator
may have some asssumptions/design decisions that interact badly with
our calling convention.
Roman and I were a while ago investigating an issue where llvm wasn't
doing a very good job for some dph code, quite a few unnecessary
spills. We thought it was an aliasing issue but in the end seemed to
be an llvm problem with the instruction selector/scheduler creating a
lot of register pressure.
Cheers,
David
I don't expect us to do better than the NCG here, because the NCG code is
just about optimal, but I would like to use -fllvm on other parts of the RTS
code so it would be good if we could generate code that is at least as good
as the NCG here.
Cheers,
Simon
_______________________________________________
Cvs-ghc mailing list
Cvs-ghc@haskell.org
http://www.haskell.org/mailman/listinfo/cvs-ghc