https://gcc.gnu.org/bugzilla/show_bug.cgi?id=120916

--- Comment #7 from Jan Hubicka <hubicka at gcc dot gnu.org> ---
LLVM also gets execution counts wrong, just the different (and less harmful)
way:

test:270773509:9780
 1: 9116
 2: 51984      for (
 4: 51984           i<s <- this is i<s and should also have large count
 5: 7081488         i++
 6: 7081488         a[i]++
 7: 8576
main:36431:0
 1: 0
 2.1: 9051
 3: 9278 test:9780
 4: 0

I am confused why the autofdo tools does this.  In the internal loop we output:

.L4:
        .loc 1 10 11 is_stmt 1 view .LVU8      <- a[i]
        .loc 1 10 15 is_stmt 0 view .LVU9      <- ++
        movdqa  a(%rax), %xmm0
        addq    $16, %rax
        paddd   %xmm1, %xmm0
        movaps  %xmm0, a-16(%rax)
        .loc 1 9 15 is_stmt 1 view .LVU10      <- i++
        .loc 1 8 16 view .LVU11                <- i<s
        cmpq    %rax, %rdx
        jne     .L4

Exchanging to

        .loc 1 8 16 view .LVU11                <- i<s
        .loc 1 9 15 is_stmt 1 view .LVU10      <- i++

yields to:
test total:2652901 head:4123
  3: 0
  4: 4123
  5: 1322715
  6: 1322715
  7: 3348
main total:3983 head:0
  1: 0
  2.1: 1916
  3: 2067  test:1925
  4: 0

So it seems that the tool only takes only the first location of the sample,
which is odd, since debug stmts may come from multiple original basic blocks
and this fact is not visible. 

Ideally we could do something like:
.L4:
        .loc 1 10 11 is_stmt 1 view .LVU8      <- a[i]
        movdqa  a(%rax), %xmm0
        .loc 1 9 15 is_stmt 1 view .LVU10      <- i++
        addq    $16, %rax
        .loc 1 10 15 is_stmt 0 view .LVU9      <- ++
        paddd   %xmm1, %xmm0
        movaps  %xmm0, a-16(%rax)
        .loc 1 8 16 view .LVU11                <- i<s
        cmpq    %rax, %rdx
        jne     .L4

Which would make things to work (since there are no chained debug stmts) and
breakpointing would be less surprising but I understand it is not designed to
work this way....

llvm does
.LBB0_4:                                # =>This Inner Loop Header: Depth=1
        .loc    0 10 15 is_stmt 1 discriminator 33 # ll.c:10:15
        movdqa  (%rsi,%rdi), %xmm1              
        movdqa  16(%rsi,%rdi), %xmm2            
        psubd   %xmm0, %xmm1                    
        psubd   %xmm0, %xmm2                    
        movdqa  %xmm1, (%rsi,%rdi)              
        movdqa  %xmm2, 16(%rsi,%rdi)            
        .loc    0 9 15 discriminator 33         # ll.c:9:15
        addq    $32, %rsi                       
        cmpq    %rsi, %rdx                      
        jne     .LBB0_4                         

So it has only line 9 and 10. Large discriminator numbers seems to be FS
discriminator encoding.  LLVM assigns discriminators twice.  First one is done
similarly as we do, but scaled up.  

I think it is supposed to handle when statement gets duplicated into multiple
basic blocks, like a[i]++ does.  So it has:

        .loc    0 10 15 is_stmt 1 discriminator 33 # ll.c:10:15
        movdqa  (%rsi,%rdi), %xmm1              
        movdqa  16(%rsi,%rdi), %xmm2            
        psubd   %xmm0, %xmm1                    
        psubd   %xmm0, %xmm2                    
        movdqa  %xmm1, (%rsi,%rdi)              
        movdqa  %xmm2, 16(%rsi,%rdi)            

for the vectorized body and

        .loc    0 10 15 is_stmt 1               # ll.c:10:15
        leaq    (%rcx,%rdx,4), %rdi             
        incl    (%rsi,%rdi)                     

for epilogue. Tool has -fuse_discriminator_encoding option which then merges
values back.  I will look into what this really does.

Reply via email to